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THE  SECOND  INTERNATIONAL  CONFERENCE 
ON  VECTOR  AND  PARALLEL  COMPUTING 


Introduction 

This  5-day  conference  was  organized  principally  by 
the  IBM  Bergen  Scientific  Center  with  the  participation 
of  the  Society  for  Industrial  and  Applied  Mathematics 
and  the  Association  for  Computing  Machinery.  Held  in 
Bergen,  Norway,  the  conference  was  attended  by  450  par¬ 
ticipants  coming  mainly  from  Western  Europe  and  the 
United  States. 

The  program  included  invited  speakers,  contributed 
papers,  and  student  scholarship  papers.  My  report  in¬ 
cludes  summaries  of  the  lectures  given  by  the  invited 
speakers  based  on  notes  taken  during  the  conference  and 
the  authors’  abstracts  of  the  contributed  and  student 
.scholarship  papers.  These  were  given  in  parallel  sessions, 
only  half  of  which  I  could  attend. 

Invited  Speakers’  Presentations 

"The  Grand  Challenge  of  Supercomputing" 

Alan  iVcis,  keynote  speaker,  IBM  Data  Sysients  Division,  US,  i-'ice  Presi¬ 
dent  of  Engineering  and  Scientific  Computing. 

There  is  an  ever-changing  role  for  supercomputers 
in  the  modern  scientific  community.  One  important  chal¬ 
lenge  is  that  of  accelerating  the  absorption  of  supercom¬ 
puters  into  the  broader  community  of  busy  u.sers-as 
contrasted  with  computer  specialists. 

The  predominant  tools  of  investigation  in  experimen¬ 
tal  science  of  the  past  have  become  too  costly  to  use.  The 
need  is  increasing  for  more  numerical  approaches  to  sol¬ 
ving  these  problems.  Hence  there  is  a  growing  need  for 
supercomputers  in  such  areas  as  aircraft  design,  weather 
forecasting,  and  exploration  in  the  oil  industry. 

Today  the  mainstream  supercomputer,  viewed  by  the 
busy  user,  is  expensive,  hard  to  use,  experimental,  and 
limited  to  highly  specialized  users.  The  busy  user  of  com¬ 
puters  simply  wants  to  solve  problems  in  science  and 
mathematics  without  the  need  to  become  a  super  special¬ 
ist  in  computer  architecture.  The  need  for  the  main¬ 
stream  user  of  the  future  is  a  system  which  is  cost  effective, 
reliable,  easy  to  use,  rich  in  software  for  applications,  and 
robust  in  the  data  management. 

Why  are  users  moving  toward  the  supercomputer?  It 
is  becoming  widely  recognized  by  government,  industry, 
and  universities  as  a  u.seful  and  more  and  more  available 
tool  for  solving  problems  requiring  lots  of  data  and  very 

Dr.  Blackburn  is  the  London  representative  for  the  Commerce  De¬ 
partment  for  industrial  assessment  in  computer  science  and  tele¬ 
communications. 


fast  computation.  The  initiatives  of  the  NSF  in  the  US 
along  with  various  European  initiatives  are  helping  to 
bring  this  about.  New  applications  requiring  extensive 
computation  like  the  use  of  fractals  for  various  problems 
in  physics  and  engineering  and  the  use  of  finite  elements 
in  fuselage  design  in  the  aeronautics  industry  are  bring¬ 
ing  about  more  and  more  demand  for  supercomputers. 
Also,  manufacturers  are  producing  more  powerful  and 
more  robust  systems  than  ever  before.  And  this  technol¬ 
ogy  is  being  put  into  the  hands  of  users. 

The  technology  transfer  of  the  supercomputer  is 
through  the  academic  and  research  scientists,  industrial 
research  scientist,  and  the  industrial  and  commercial  end 
user.  This  move  to  supercomputers  is  happening  now  be- 
cau.se  the  environment  for  their  use  is  better  understood 
and  the  requirements  for  the  applications  are  being  bet¬ 
ter  addre.ssed.  The  necessary  components  for  further  ex¬ 
tending  the  use  of  supercomputers  are  technological 
advances  and  architectural  developments. 

The  relevant  technological  advances  in  logic  and 
memory  are  related  to  miniaturization,  price  reduction, 
tools,  and  facilities.  Significant  developments  relating  to 
miniaturization  include  the  scanning  tunneling  micro¬ 
scope,  better  understanding  of  materials  at  the  atomic 
level,  and  understanding  the  silicon  surface.  Improved 
processing  tools  include  E-beam  and  x-ray  lithography. 
Another  important  development  is  that  of  advanced  fa¬ 
cilities  like  super-clean  rooms  in  which  to  produce  com¬ 
ponents. 

Important  advances  in  field  effect  transistors  will 
lead  to  one  million  transistors  per  chip  and,  of  perhaps 
equal  importance,  are  the  developments  in  packaging, 
which  requires  an  understanding  in  depth  of  materials. 
Precision  in  disk  storage  devices  now  permits  read/write 
heads  to  be  10  millionths  of  an  inch  above  the  disk  sur¬ 
face.  This  disk  surface  must,  of  course,  be  totally  flat. 
Optical  storage  has  been  developed  to  the  point  where 
laser-written  pits  can  be  at  a  spacing  of  five  microns. 

Developments  in  system  architecture  include  higher 
performance  in  both  scaler  and  vector  processors,  more 
parallelism  for  both  general  purpose  and  special  purpose 
computers,  larger  memories,  and  much  higher  bandwidth 
for  communication  with  input  and  output  subsystems. 

More  mature  system  software  is  now  available  to  per¬ 
mit  interactive  computing  with  systems  having  very  large 
virtual  memories.  Management  control  systems  arc 
greatly  improved  and  computer  technology  for  vectoriza- 
tion  and  parallelism  is  now  becoming  available.  Data 
management  systems  now  offer  greater  reliability  and  fas¬ 
ter  access.  Network  connectivity  is  evolving  toward 
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standard  protocols  like  OSl.  Workstations  arc  more 
thoroughly  integrated  into  systems  allowing  cooperative 
processing. 

This  year,  using  X-windows  on  a  vector  computer 
scientists  were  able  to  ob.scrvc  and  steer  processes  and 
"fly"  over  a  silicon  chip  for  more  information  on  chemical 
bonds  through  changing  the  current  applied  to  the  ma¬ 
terial. 

Network  access  and  interconnection  is  featuring  high 
bandwidth,  management  btr  .service  through  programs 
and  data  libraries,  and  migration  to  OSl  standards.  Cur¬ 
rent  networks  in  the  US,  Europe,  and  Japan  will  shortly 
feature  speeds  up  to  4.S  Mb/sec.  1  hey  allow  expansion  as 
in  the  US  NSF  network  connection  to  Europe  and  Japan. 

P.esent-day  networks  do  not  do  a  very  good  job  of 
network  management.  NSF  will  concentrate  on  network 
management  and  the  u.sc  of  program  libraries  to  help  in 
cross-discipline  use. 

Traditionally  the  architecture  of  a  computer  system 
influenced  the  algorithms  used  with  the  .system  in  appli¬ 
cations.  The  trend  now  is  toward  the  algorithm  inlluenc- 
ing  the  architecture. 

Among  the  grand  challenges  remaining  are;  getting 
the  supercomputer  to  be  more  widely  used  by  the  busy 
user,  the  integration  of  CAD/CAE/CAM,  and  use  in  com¬ 
putational  chemistry.  Government,  universitie.s,  and  in¬ 
dustry  need  to  work  together  to  apply  the  required 
interdisciplinary  resources  to  meet  these  challenges. 
And  they  must  work  together  to  insure  that  educational 
needs  are  adequately  met. 

"Supercomputing  as  a  Tool  for  Product 
Development" 

llan  Erisman,  Boeing  Computer  Sct\'ices. 

As  Richard  Hamming  said  in  1962,  "The  purpose  of 
computing  is  insight,  not  numbers."  However,  in  much  of 
today’s  computing  the  numbers  are  very  important.  The 
computing  requirements  of  such  areas  have  data  procc.s.s- 
ing  characteristics  that  have  long  differed  from  scientific 
computing  requirements  where  data  is  analyzed  as  op¬ 
posed  to  processed.  FORTRAN,  supercomputers,  mini¬ 
computers,  and  workstations  dominate  today’s  .scientific 
computing;  fourth-generation  languages,  mainframes, 
and  personal  computers  dominate  data  processing. 

Because  of  this  background,  supercomputers  arc  as¬ 
sociated  with  scientific  analysis;  the  use  of  today’s  vector 
and  parallel  computers  requires  that  code  be  adapted  to 
the  architectural  environ. nent  to  use  the  control  process¬ 
ing  units  effectively.  The  proliferation  of  so-callcd  ncar- 
supercomputers  has  made  vector  and  parallel 
architecture  available  to  a  broader  group  of  engineers 
and  scientists. 

The  use  of  these  analyses  results  in  industry  requires 
that  they  be  accessible  to  the  other  part  of  the  computing 
world  where  products  are  made.  Manufacturing  con¬ 
siderations  place  design  constraints  on  the  products  to  be 
built,  nd  this  changes  the  models  which  must  be  ana¬ 


lyzed.  In  order  to  produce  a  product  at  low  cost,  the  ana¬ 
lysis  alternatives  need  to  include  pricing  data. 

The  separate  worlds  of  scientific  computing  and  data 
processing  will  have  to  come  together.  In  sophisticated 
analy.ses  made  possible  by  the  supercomputer  u.sers  know 
too  little  about  CAD/CAM,  and  vice  versa. 

Supercomputers  are  pow  erful  engines  which  provide 
opportunities  to  solve  problems  previously  intractable. 
Computer  systems  provide  powerful  computing  hard¬ 
ware  operating  .sy.stems,  compilers,  and  languages.  Scien 
tific  computing  provides  algorithms,  applications 
programs,  and  a  computing  environment  with,  for 
example,  data  management. 

.'iome  succcs.ses  due  to  supercomputers  in  science 
have  occurred  in  understanding  molecular  structure  and 
fluid  flow  under  various  conditions.  Successes  in  pro¬ 
ducts  include  the  design  of  the  Boeing  737-3000,  the  Ford 
Taurus,  and  enhanced  oil  recovery.  The  benefits  relate 
to  product  performance  rather  than  to  technical  accom¬ 
plishments.  The  supercomputer  is  more  than  a  research 
and  development  tool. 

The  supercomputer  impact  on  the  design  of  the 
Boeing  737-3000  had  to  do  with  the  engine  placement  in 
relation  to  the  total  aerodynamics  of  the  entire  airplane. 
In  products  the  supercomputer  may  benefit  a  design  by 
showing  that  a  small  change  in  design  may  produce  a  large 
performance  difference.  This  may  result  in  significant 
economic  benefits. 

We  need  new  supercomputer  models  for  dealing  di¬ 
rectly  with  product  development.  There  is  a  close  link  be¬ 
tween  research  and  product  development,  and  also  a 
greater  need  for  the  supercomputer  to  be  used  more  fre¬ 
quently  by  the  so-called  bu.sy  user  who  doesn’t  have  time 
to  become  a  computer  expert.  We  need  to  deal  with  all 
the  available  or  obtainable  data. 

There  is  a  potential  for  computers  to  do  more  in  pro¬ 
duct  design.  The  requirement  is  for  ever-more-powerful 
supercomputers.  This  presents  an  opportunity  and  a 
challenge  to  the  computer  manufacturer.  New  modeling 
approaches  are  possible  through  better  analysis  and  de¬ 
sign  optimization. 

Parallelism  in  computers  poses  a  tough  challenge. 
There  is  very  little  software  experience  with  parallelism. 

Libraries  have  an  important  role  in  supercomputing. 
The  performance  of  a  vector  computer  is  closely  tied  up 
with  the  application  —  e.g.,  dependent  on  the  amount  of 
natural  parallelism  in  the  application  program.  Highly 
tuned  libraries  of  application  programs  can  have  an  im¬ 
portant  impact  on  performance.  In  a  parallel  computer 
not  only  are  the  architecture  and  the  algorithm  important 
to  performance  but  also  the  application. 

Modeling  involves  design  optimization  —  integrated 
analysis  including  structural  design,  control  systems,  ther¬ 
mal  design,  and  aerodynamics  design.  It  may  also  include 
artificial  intelligence  and  .symbolic  computing.  Manufac¬ 
turability  issues  must  be  considered  in  modeling  and  dc- 
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sign.  The  total  analysis  process  needs  to  be  integrated 
into  one  application. 

At  pre.sent  the  supercomputer  is  involved  in  perfor¬ 
mance  analysis  but  not  in  CAD/CAM.  Thus,  the  perlor- 
mance  analysis  is  not  carried  through  as  it  should  be. 
There  is  a  serious  need  for  proper  integration  of  the  whole 
process. 

Integration  in  the  computer  sense  involves  the  super¬ 
computer,  the  work  station.s,  graphics  capability  data 
management,  high-bandwidth  communication,  a  com¬ 
mon  operating  system,  and  artificial  intelligence  tools. 
The  user  at  his  terminal  is  the  foreground  of  a  properly 
integrated  system,  and  all  else  is  background. 

In  summary,  we  need  more  powerful  supercompu¬ 
ters,  better  modeling  and  algorithm  development,  inte¬ 
grated  computer  systems,  better  data  management,  and 
company  organization. 

"On  the  Suprenum  System" 

Ulrich  Troneiibcrg,  GMD,  tVcil  Germany. 

The  Suprenum  is  a  supercomputer  for  numerical  ap¬ 
plications.  The  hardware  is  a  highly  parallel  MIMD 
architecture  in  which  nodes  of  processors  have  vector 
units  and  local  memory.  It  was  not  considered  to  be  a 
good  approach  to  design  and  optimize  for  a  particular  al¬ 
gorithm,  to  design  sequentially  for  a  conventional  com¬ 
puter,  or  to  design  special  purpose  computation  for 
certain  large-scale  applications  on  the  basis  of  old  fa¬ 
shioned  algorithms.  The  idea  was  to  design  a  .system  for 
general  large-scale  applications. 

Suprenum  1  is  scheduled  for  completion  at  the  end 
of  1989  with  a  planned  performance  of  4  Gflops.  Supre¬ 
num  2  is  a  research  project  but  will  be  a  product  at  a  later 
date. 

The  node  of  a  Suprenum  w'ill  consist  of  16  worker 
processors  (MC  68020)  each  with  a  floating  point  vector 
unit  (Weitek)  rated  at  16  Mflops  and  local  memory.  A 
cluster  will  consi.st  of  16  nodes  for  a  total  of  256  proces¬ 
sors.  A  high-performance  .system  would  consist  ot  4x4 
clusters  extendable  to  16x16  clusters. 

Although  a  256- processor  .system  would  have  a  the¬ 
oretical  maximum  performance  of  4  Gflops,  a  more  real¬ 
istic  actual  performance  is  likely  to  be  1-2  Gflops. 

The  system  is  architecturally  a  compromise  between 
full  connectivity  of  all  processors  in  the  system  and  strict¬ 
ly  local  connectivity.  The  architecture  is  a  two-level  bus- 
coupled  architecture. 

An  abstract  Suprenum  machine  exists  which  allows 
hardware-independent  programing.  There  i.s  also  a  Su¬ 
prenum  FORTRAN  and  a  concurrent  Modula  1-2.  The 
Suprenum  FORTRAN  is  an  extended  FORTRAN  77 
with  process  handling,  mc.ssagc  passing,  and  array  oper¬ 
ation  capability.  Also,  the  concurrent  Modula-2  is  an  ex¬ 
tension  of  the  original  Modula-2.  There  will  also  be  a 
communication  library  for  Grid  applications.  An  import¬ 
ant  tool  is  a  dynamic  map  which  gives  a  picture  of  all  pro¬ 
cessors  at  a  given  instant. 


A  basic  numerical  library  of  applications  is  planned 
w'hich  will  cover  linear  algebra,  multigrid  solvers  for  par¬ 
tial  differential  equations,  and  an  ordinary  iliffcrcnlial 
c()uations  package.  Another  (lackage  will  provide  full 
potential  equation  .solutions  for  subsonic  and  transonic 
How. 

There  was  an  8-node  version  of  Suprenum  running  as 
of  April  1987.  A  32-node  version  is  expected  to  be  oper 
ational  in  the  first  quarter  of  1989  and  will  be  dcn’<m- 
strated  at  the  Hannover  Fair,  5  April  1989.  The  system  is 
to  be  manufactured  for  marketing  by  1990. 

“IBM  Supercomputing  Trends  and  Directions" 

Alee  Grimison,  IBM. 

The  base  for  supercomputers  in  IBM  is  the  3090- 
600E  to  which  can  be  added  up  to  six  vector  processors. 
The  3090  is  an  outstanding  scalar  processor.  Its  vector 
capacity  is  optimized  for  applications  that  are  60-70  per¬ 
cent  vector.  The  .system  has  a  large  memory,  and  ex¬ 
panded  ranging  from  256  Mb  to  2  Gb.  It  has  an  cxeelleni 
aggregation  of  input/output  equipment  and  is  a  very  high 
performance  system. 

Its  virtual  store-extended  architecture  software 
MVS/ESA  has  16  Tb  addressable.  It  also  supports 
VM/XA  and  AlX/370.  The  VS  FORTRAN  vectorizing 
compiler  and  parallel  FORTRAN  for  single-job  turn¬ 
around  arc  available  on  the  3090-600E.  There  arc  more 
than  40  application  packages  available  with  the  .system. 

The  critical  elements  in  IBM’s  supercomputer  strate¬ 
gy  arc: 

•  A  balance  between  scalar,  vector,  and  parallel  process¬ 
ing 

•  Large  memories  and  input/output  capability  to  match 
the  system’s  computing  power 

•  Compatibility  with  existing  systems 

•  High  use  by  balanced  capability  between  the  need  for 
throughput  and  turnaround 

•  Moderate  parallelism  with  shared  global  memory. 

The  strategy  includes  firm  coupling: 

•  Connection  between  IBM  3090  complexes  to  provide 
very-large-scale  scientific  computing 

•  Careful  balancing  of  hardware,  software,  and  systems 
requirement 

•  High  performance. 

The  current  hardware  allows  multiple  4.5-Mb/s 
channels.  The  need  for  higher  data  rate  is  foreseen  and 
being  explored.  IBM  is  an  active  participant  in  the  stand¬ 
ards  group,  ANSI,  which  is  drafting  standards  for  high¬ 
speed  connection  interfaces  for  up  to  100  MbAs. 

Multiple  3090  complexes  do  not  have  shared  mem¬ 
ory  in  the  S/370  sense.  Parallel  FORTRAN  currently  has 
shared  and  private  common  areas.  Extensions  to  paral¬ 
lel  FORTRAN  will  make  designated  common  blocks  of 
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mcmiiry  appear  globally  shared.  The  effect  on  perfor¬ 
mance  is  being  explored. 

An  applications  set  of  programs  is  being  analy/.ed  U> 
handle  situations  between  the  extremes  of  obviously  par¬ 
allel  and  nonparallel  subsets.  Also  the  balance  between 
higli-conmiunicalions  with  coinpute-ratio  is  under  study. 
There  are  ongoing  studies  at  Cornell  University  and  IBM 
Yorktown  Heights  Research  Center. 

A  hierarchical  approach  to  processing  will  allow  de¬ 
composing  a  problem  among  complexes  with  computa¬ 
tion  time  in  excess  of  intcrclustcr  communications. 

The  subject  of  visualization  in  scientific  computing 
needs  to  be  defined,  and  the  extent  of  need  should  be 
determined.  It  may  be  defined  as  interactive  display  and 
processing  of  data  from  an  ongoing  su{>ercomputer.  The 
need  is  to  steer  simulations  in  computationally  real  time. 
A  further  need  is  for  higher  resolution  color  graphics,  and 
much  higher  communication  bandwidth. 

Further  along  we  will  evolve  very  high-end  extended 
systems  architecture  ESA/37()  for  large-scale  computer 
systems.  Such  a  system  will  build  on  the  commercial 
hardware/software  base  which  will  be  extended.  Super 
scal.ir  performance  characteristics  wilt  be  exploited.  A 
system  balance  will  be  maintained  between  vector,  scalar, 
parallelism,  memory,  and  input/output. 

IBM  is  wedded  to  the  future  of  parallelism.  We  shall 
optimize  to  high  vector/parallel  content  as  applications 
esolve.  We  shall  optimize  to  user  demands  such  as  mem¬ 
ory  bandwidth  for  technical  computing.  The  ESA/370 
hardware  and  software  technology  will  be  driven  to  its 
limit.  There  will  be  some  fallout  for  u.se  in  the  commer¬ 
cial  line  of  computers. 

IBM  intends  to  be  a  major  participant  in  providing 
supercomputer  solutions.  We  have  the  appropriate  ba.se 
to  allow  for  rapid  enhancement.  We  will  continue  to  em¬ 
phasize  the  balanced  systems  approach. 

"Parallel  Logic  Programing" 

David  fVarren,  Manchester  University,  UK. 

Japan’s  Fifth-Generation  project  has  helped  to  high¬ 
light  logic  programing  as  a  unifying  framework  on  which 
to  build  advanced  computer  systems,  particularly  for  non- 
numeric  applications.  One  of  the  key  features  of  logic 
programing  as  a  framework  is  that  it  provides  a  powerful 
model  of  computation  which  lends  itself  to  parallel  im¬ 
plementation. 

There  are  two  main  kinds  of  parallelism  in  logic  pro¬ 
grams:  or-parallelism  ai.d  and-parallclism.  Or-parallel- 
ism  enables  alternative  solutions  to  a  query  to  be  found  in 
parallel.  And-parallelism  enables  different  steps  toward 
a  single  solution  to  a  query  to  be  performed  in  parallel. 

The  performance  of  computers  running  logic  pro¬ 
grams  is  measured  in  logic  inferences  per  second  (LIPS). 
Manchester  with  the  support  of  the  Science  and  Engin¬ 
eering  Research  Council  (SERC)  is  working  with  MCC, 
Austin,  Texas,  toward  a  powerful  sy.stcm. 


The  project  is  motivated  by  the  cost  effectiveness  of 
using  multiprocessiirs  and  the  difficulties  in  programing 
for  them.  There  will  be  wide  acceptance  of  multiprocess¬ 
ing  only  when  the  systems  are  regarded  by  programers  as 
a  "black  box. " 

Prolog  was  chosen  because  it  is  adequate  for  real  ap¬ 
plications;  it  is  widely  known  and  used  and  is  a  well 
adapted  technology.  Prolog  can  be  considered  a  gencr- 
aliz.ation  of  a  functional  language  like  pure  LISP. 

The  Japanese  institute  ICOT  is  using  dependent  and- 
parallciism  in  languages  like  Prolog  and  Concurrent  Pro¬ 
log.  We  at  Manchester  are  working  with  or-parallelism, 
as  in  the  SRI  model.  Our  aim  is  to  run  real  applications 
faster  without  changing  the  program. 

(Jur  system  is  called  the  Aurora  system  .and  is  con- 
.stiiutcd  of  Sicstus  Prolog  plus  the  SRI  model  -*-  a  sche¬ 
duler.  .Aurora  is  a  full  Pn  dog  system  with  good  speed  and 
speed  ups. 

In  the  future  nsc  expect  to  explore  new  applications 
and  the  scheduling  of  speculative  work  and  to  incorpor¬ 
ate  depcndenl  and-p.irallelism  in  an  Andorra  model 
which  will  combine  and-  and  or-parallelism. 

We  want  prrjcessors  to  share  data  rather  than  mem¬ 
ory.  Each  datum  will  be  identified  by  a  virtual  addres.s. 
Virtual  addresses  can  be  mapped  quite  Ilexibly  onto 
physical  addre.s.scs.  There  may  i?e  multiple  copies  of  .1 
particular  datum  and  the  physical  location  of  a  datum  will 
be  transparent  to  the  user.  Data  will  simply  migrate  to 
where  it  is  needed. 

A  communications  controller  will  handle  local  and 
nonlocal  memory  accesses.  Local  communication  will 
allow  reading  of  a  local  datum  or  write  an  unshared 
datum.  Nonlocal  communication  will  allow  reading  a 
nonlocal  datum,  broadcast  to  nearest  copy  and  mark  as 
required. 

A  data  diffusion  machine  is  characterized  by  shared 
virtual  memory  but  not  shared  physical  memory.  It  is 
scalable  and  data  migrates  automatically  to  minimize 
remote  access. 

"Domain  Decomposition  Algorithms  and  Applica¬ 
tions  to  Fluid  Dynamics" 

tony  Chan,  UCLA. 

Domain  decomposition  is  a  class  of  methods  for  sol¬ 
ving  mathematical  physics  problems  by  decomposing  the 
physical  domain  into  smaller  subdomains  and  obtaining 
the  solution  by  solving  smaller  problems  on  these  subdo¬ 
mains.  Motivation  for  this  approach  may  be; 

•  The  ability  to  use  different  mathematical  models  and 
approximation  methods  in  different  subdomains 

•  Use  of  fast,  direct  methods  in  subdomain 

•  Memory  limitations  of  the  computer 

•  Suitability  for  implementation  on  parallel  computers. 

Applications  can  be  found  in  many  areas  of  scientific 
computing,  su'  li  as  compiit.alional  fluid  dynamics  and 


structural  mechanics.  The  key  ingredient  in  many  of 
these  methed.  is  the  system  of  equations  governing  the 
varial)leson  the  interfaces  between  the  subdomains  which 
is  often  .solved  by  preconditioned  iterative  methods. 

One  question  that  arises  is  whether  or  not  to  have 
overlapping  subdomai''".  In  the  view  of  Professor  Chan 
it  doesn’t  seem  to  make  much  difference. 

In  solving  partial  differential  equations  on  a  domain 
using  cither  finite  difference  of  finite  elements  the  domain 
is  decomposed  into  subdomains,  on  each  of  which  the  sol- 
u'iiin  is  simpler  than  on  the  entire  domain.  The  subprob¬ 
lems  are  solved  on  the  subdomains  and  these  solutions 
arc  pieced  together  to  arrive  at  a  global  solution  on  the 
entire  domain. 

The  piecing  together  involves  solving  the  system  of 
equations  governing  the  variables  on  the  interfaces  be¬ 
tween  the  subdomains.  One  approach  is  to  estimate  the 
solution  on  the  interior  boundary  and  carry  out  succc.ssive 
iterations  until  the  equations  are  .satisfied  to  an  accept¬ 
able  level.  Often  a  preconditioner  is  u.sed  such  as  the  Pre- 
conditioner  Conjugate  (iradienl. 

"Domain  Decomposition  Methods  for  Parallel  Com¬ 
puter" 

(' tcrard  Meuronf,  Centre  d'  Etudes  dc  l.utted-  V'alcnton,  France 

Domain  Decomposition  methods  were  originally  de¬ 
veloped  to  solve  large  problem.s  on  computers  with  small 
memory  or  to  decompose  problems  on  complex  geome¬ 
tries,  allow  ing  fast  methods  to  be  used  on  the  subdomains. 
Today,  these  methods  have  become  interesting  for  use 
w  ith  parallel  computers,  mainly  of  the  MIMD  type. 

The  oldest  of  the  domain  decomposition  methods  is 
the  Schwarz  Alternating  Method  with  overlapping  sub- 
domains.  Some  new  results  both  on  convergence  and  ap¬ 
plications  of  this  method  have  recently  been  found.  The 
other  methods  are  mainly  related  to  the  Conjugate  Gra¬ 
dient  Method,  as  ways  to  derive  efficient  preconditioners. 
The  methods  are  classified  by  the  kind  of  problems  they 
can  solve  or  by  the  type  of  parallel  computer  to  which  they 
are  best  adapted. 

Some  methods  rely  on  knowledge  about  the  under¬ 
lying  partial  differential  equation,  use  direct  .solvers  on 
the  subdomains,  and  so  arc  targeted  to  parallel  compu¬ 
ters  with  a  very  large  number  of  processors.  Some  others 
are  purely  algebraic  methods  and  use  approximate  st)l- 
vers  on  the  subdomains,  and  hence  are  more  suitable  for 
computers  with  small  numbers  of  powerful  vector  proces¬ 
sors. 

The  procedure  in  any  case  is  to  split  the  problem  into 
picce.s,  solve  the  pieces  in  parallel  on  the  subdomain.s,  and 
then  put  the  pieces  together  to  get  the  global  solution. 

Domain  decomposition  methods  differ  in  several 
ways; 

•  The  method  of  partitioning  may  be  with  or  without 
overlapping  of  subdomain.s  and  the  subdomains  may 
be  stripes  or  boxes  over  the  domain. 


•  The  method  of  solution  on  the  subdomain  may  he 
exact,  approximate,  or  an  exact  solution  of  an  approxi¬ 
mate  problem. 

•  The  method  of  construction  of  the  problem  for  the  in¬ 
terfaces  may  be  from  the  partial  differential  equation 
or  algebraically  from  the  matrix  of  coefficients. 

Meurant’s  group  has  chosen  a  target  architecture 
using  stripes  and  a  supercomputer  with  a  few  very  power¬ 
ful  processors  using  shared  memory.  With  the  use  of 
stripes  as  subdomains  they  will  be  looking  for  long  vec¬ 
tors. 

The  Schwarz  method  of  solution  uses  a  block  Gauss- 
Seidel  method  applied  to  the  matrix  of  coefficients.  A 
large  number  of  iterations  is  required  when  there  is  little 
overlapping  of  subdomains  but  the  number  of  iterations 
drops  dramatically  for  large  overlapping. 

Other  methods  of  solution  are  to  use  the  Conjugate 
Gradient  method  to  accelerate  the  convergence  process 
on  the  interface,  or  to  use  the  Block  Jacobi  method  with 
overlapped  subdomains. 

The  Domain  Decomposition  method  is  especially 
well  suited  to  parallel  supercomputers. 

"Comparison  of  Super  and  Mini-Super  Computers 
for  Computational  Fluid  Dynamics  Calculations  ' 

Wolfgang  Genizscit,  Fachhochschuk  Regensburg  West  Germany. 

A  model  benchmark  has  been  developed  to  e.stimate 
the  performance  of  supercomputers  for  engineering  and 
scientific  applications.  It  consists  of  four  parts:  special 
kernels,  basic  linear  algebra  routines,  iterative  solvers  for 
systems  of  equations,  and  application  programs. 

By  using  about  100  kernels  and  basic  linear  algebra 
routines  it  is  possible  to: 

•  Test  the  capability  and  the  limits  of  the  vectorizing  and 
parallelizing  compiler 

•  Estimate  the  performance  for  basic  operations  de¬ 
pending  on  vector  length. 

Twenty-five  variants  of  different  linear  algebraic  .sys¬ 
tems  .solvers  are  u.sed  to: 

•  Study  restructuring  with  respect  to  the  special  vector 
and  parallel  architecture 

•  Conclude  basic  vectorization  and  parallelization  rules 
for  numerical  algorithms. 

In  addition,  five  production  codes  from  plasma 
physics,  Euler  and  Navier-Stokes  Flow,  Grid  Cicncration, 
and  Multigrid  have  been  included  to  get  better  insight 
into  more  complicated  constructs  within  more  complex 
programs. 

The  solution  of  equations  like  the  Navier-Stokes  in¬ 
volving  the  conservation  of  mass,  momentum,  and  energy 
involve  a  number  of  steps: 

•  Mesh  generation 

•  Discretzation  of  the  partial  differential  equations 
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•  Determining  a  starting  solution 

•  Modeling  of  turbulence 

•  l''  :!!:ding  stability  and  convergence 

•  Applying  artificial  \  iscosity 

and  may  include 

«  The  use  of  computer  graphics. 

Such  problems  involve  a  very  large  number  of  un¬ 
knowns  -  as  many  as  30  to  40  at  every  grid  point.  Thus,  a 
30x30  grid  would  involve  .36,000  unknowns. 

The  results  of  benchmarking  of  such  problems  dc- 
[lend  on  the  machine  architecture,  the  compiler  used,  and 
the  ilgorithms.  Thus,  benchmarks  can  be  misused,  the 
choice  (or  omission)  of  optimal  kernels  can  innuence  the 
average  machine.  Often  decisions  are  made  on  bench¬ 
mark  results  which  are  not  entirely  indicative  of  perfor¬ 
mance. 

Images  of  Matrices" 

Xfohr.  .  Xnknt  Computers,  US. 

That  supercomputers  have  been  proven  effective  for 
computation  but  have  not  yet  been  proven  for  visualiza¬ 
tion  is  a  paradox.  Mathematical  visualization  means  the 
use  of  (lowerful  graphics  software  and  hardware  to  inves¬ 
tigate  mathematical  computations. 

The  underlying  tools  for  mathematical  visualization 
include: 

•  Titan,  Ardent’s  new  graphics  supercomputer  with 
super  parallel  architecture  with  various  levels  of 
parallelism  including  pipelining  and  vector  parallel¬ 
ism 

•  .Software,  consisting  of  four  groups;  an  operating  sy.s- 
icm;  compilers  for  various  languages;  graphics  (Ar¬ 
dent  s  Dynamic  Object  Rendering  Environment 
(DORE));  and  science  software,  c.g.,  Math  Work’s 
Matrix  Laboratory  (MATLAB). 

DORE  provides  a  connection  between  computing 
and  graphics.  It  takes  geometric  components  to  produce 
realistic  objects. 

The  classic  MATLAB  was  developed  by  Moler  at 
/\rgonne  National  Laboratory  and  Stanford  University. 
The  commercial  version  of  MATLAB  is  written  in  the 
language  C  and  is  useful  in  two-  and  three-dimensional 
graphics.  It  is  also  extensible. 

Examples  of  the  f  e  of  MATLAB  and  DORE  in¬ 
clude: 

•  A  dynamic  portrait  of  a  vibrating  L-shaped  mem¬ 
brane 

«  A  view  of  matrix  decomposition  algorithms 

•  Surfaces  defined  by  mapping  of  the  complex  plane 

•  Solutions  of  some  model  partial  differential  equa- 
li-'ns. 


"Parallel  Integration  of  Vision  Models" 

James  Utile,  .irtificial  Intelligence  Laboratory,  MIT. 

Computer  vision  has  developed  algorithms  for  .sev¬ 
eral  early  vision  processes  -  such  as  edge  detection,  stere- 
opsis,  motion,  texture,  and  tx'lor  —  that  give  separate  cues 
to  the  distance  from  the  viewer  of  three-dimensional  sur¬ 
faces,  their  shape,  and  their  material  properties.  Yet,  and 
not  surprisingly,  biological  vision  systems  still  greatly  out¬ 
perform  computer  vision  piograms.  It  is  increasingly 
clear  that  one  of  the  keys  to  the  reliability,  flexibility,  and 
robustness  of  biological  vision  systems  is  their  ability  to 
integrate  the  different  visual  cues.  We  have  developed  a 
technique  to  integrate  dilferent  visual  cues,  and  have  im¬ 
plemented  it  with  encouraging  results  on  a  parallel  super¬ 
computer. 

Whereas  it  is  reasonable  lh;it  combining  the  evirlcnce 
provided  by  multiple  cues  -  lor  example,  edge  detection, 
stereo,  and  color  —  should  provide  a  more  reliable  maji  of 
the  surfaces  than  any  single  cue  alone,  it  is  not  obvious 
how  this  integration  can  be  aecomiilished.  One  r)f  the 
most  important  constraints  for  recovering  surface- 
properties  from  each  of  the  individual  cues  is  that  the 
physical  processes  underlying  image  formation,  such  as 
depth  and  orientation  and  reflectance  of  the  surfaces,  are 
typically  smooth.  Standard  regularization  (Poggio  and 
Torre,  1984),  on  which  many  examples  of  early  parallel 
vision  algorithms  are  based,  captures  this  smoothness 
property  well. 

The  physical  properties  of  surfaces,  however,  are 
smooth  almost  everywhere,  but  not  at  discontinuities. 
Reliable  detection  of  discontinuities  is  critical  for  a  vision 
system  since  discontinuities  are  often  the  most  important 
locations  in  a  scene.  The  idea  is  to  couple  different  cues 
to  the  image  data  (especially  intensity  edges)  through  the 
discontinuities  in  the  physical  properties  of  the  surfaces. 
The  goal  is,  of  course,  to  use  information  from  several 
cues  simultaneously  to  help  refine  the  initial  estimation  of 
surface  discontinuities,  which  arc  typically  noisy  and 
sparse. 

How  can  this  be  done  with  an  algorithm  that  is  intrin¬ 
sically  parallel?  We  have  chosen  to  u.se  the  machinery  of 
Markov  Random  Fields  (MRF’s),  initially  suggested  for 
image  processingby  Gcman  and  Cleman  (1984).  We  have 
extended  our  previous  work  (Marroquin  et  al,,  1987)  to 
couple  several  of  the  early  vision  modules  (depth,  motion, 
texture,  and  color)  to  intensity  edges  in  the  image.  This 
is  a  central  point  in  our  integration  scheme:  intensity 
edges  guide  the  computation  of  discontinuities  in  the 
phy.sical  properties  of  the  surface,  thereby  coupling  sur¬ 
face  depth,  surface  orientation,  motion,  texture,  and  color 
each  to  the  image  intensity  data  and  to  each  other. 

We  have  been  using  the  MRF  machinery  with  appro¬ 
priate  prior  energies  to  integrate  edge-intensity  dat;-  with 
stereo,  motion,  and  texture  information  on  the  MIT  Vi- 
siv)n  Machine  System.  The  system  consists  of  a  iwo- 
camcra  eye-head  input  device  and  a  16K  Connection 
Machine.  All  the  early  vision  algorithms  -  edge  detec- 
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lion,  stereo,  motion,  color,  and  texture  -  as  well  as  the 
MRF  algorithm,  currently  run  on  the  Connection  Ma¬ 
chine  several  hundred  times  faster  than  on  a  conventional 
machine. 

At  the  same  time,  our  integration  algorithm  achieves 
a  preliminary  classification  of  the  intensity  edges  in  the 
image,  in  terms  of  their  physical  origin.  Preliminary  ex¬ 
periments  suggest  that  recognition  algorithms  can  use  ef¬ 
fectively  the  output  of  the  integration  .scheme  described 
here. 

These  highly  parallel  algorithms  map  quite  naturally 
onto  an  architecture  such  as  the  Connection  Machine, 
\shich  consists  of  16K  simple  1-bit  processors  with  local 
and  global  connection  capabilities.  These  algorithms  also 
map  onto  VLSI  architectures  of  fully  analog  elements  and 
mixed  analog  and  digital  components. 

"Programing  Parallel  Vision  Algorithms:  A  Dataflow 
Language  Approach" 

Linda  Shapiro,  University  of  Washinpon,  Seattle. 

Computer  vision  requires  the  processing  of  large  vol¬ 
umes  of  data  and  requires  parallel  architectures  and  al¬ 
gorithms  to  be  u.scful  in  real-time  industrial  applications. 
The  INSfCHT  dataflow  language  was  designed  to  allow 
encoding  of  vision  algorithms  at  all  levels  of  the  computer 
vision  paradigm.  INSIGHT  programs,  which  are  rela¬ 
tional  in  nature,  can  be  translated  into  a  graph  structure 
that  represents  an  architecture  for  solving  a  particular  vi¬ 
sion  problem  or  a  configuration  of  a  rcconfigurable  com¬ 
putational  network. 

Single-processor,  general  purpose  computers  cannot 
provide  the  computational  power  required  for  real-time 
or  even  rca.sonable-limc  vi.sion  tasks.  At  the  image  pro¬ 
cessing  level,  parallelism  has  been  achieved  to  some  ex¬ 
tent  by  cellular  array  machines,  pipeline  architccture.s, 
and  pyramids.  In  order  to  deal  with  more  complexvision 
problems,  including  low-level,  mid-level  and  high-level 
algorithms  and  to  provide  a  greater  computational  re¬ 
source,  massively  parallel  cellular  machines  such  as  the 
Connection  Machine  and  MIMD  machines  like  the  But¬ 
terfly  have  been  built. 

In  addition,  a  new  tri-level  parallel  architecture  pro¬ 
viding  a  large  array  of  simple  processors  for  image  pro¬ 
cessing,  a  medium-sized  array  of  more  powerful 
processors  for  mid-level  vision,  and  a  small  array  of  ex¬ 
tremely  powerful  processors  for  high-level  algorithms  is 
being  developed  in  conjunction  with  the  DARPA  Image 
Understanding  Project.  Since  most  of  these  new  ma¬ 
chines  are  intended  for  defense  u.se,  they  are  currently 
much  more  expensive  than  industry  is  willing  to  pay  for 
n,  Line  vision.  For  this  reason,  reconfigurabic  architec¬ 
tures  that  have  less  processing  elements,  but  can  be  rec¬ 
onfigured  to  solve  a  variety  of  problems  are  being 
proposed. 

All  of  these  machines  need  a  language  in  which  vision 
algorithms  can  be  expressed.  If  the  language  reflects  the 
architecture  of  the  machine,  then  software  sharing  be¬ 


tween  installations  with  different  machine'  will  be  im¬ 
possible  and  much  unnecessary  effort  will  go  into  develo¬ 
ping  and  redeveloping  parallel  algorithms. 

A  more  desirable  approach  is  to  have  a  non-machiue- 
dependent  language  that  can  express  parallel  algorithms 
in  a  generic  way  and  can  be  translated  to  code  that  runs 
on  a  particular  architecture  or  to  a  configuration  of  a  rcc- 
(mfigurable  architecture.  This  was  the  approach  in  the 
design  of  INSIGHT,  a  dataflow  language  for  programing 
vision  algorithms.  INSIGHT  can  be  used  for  expressing 
low-Ievcl  mid-level  and  high-level  vision  algorithms,  and 
INSIGHT  programs  can  be  translated  to  code  that  can 
run  on  a  variety  of  architectures. 

"Seismic  Wave  Propagation  and  Absorbing  Bound¬ 
ary  Conditions" 

Johnny  I'eterson,  /iergen  Scientific  Centre,  llhU,  Noneay. 

When  computing  solutions  to  the  two-dimensional 
wave  equation  in  unbounded  domains  using  finite  dif¬ 
ference  discretization,  an  artificial  boundary  is  intro¬ 
duced.  A  boundary  condition  which  absorbs  all  outward 
propagating  waves  must  then  be  used.  Also,  the  finite  dif¬ 
ference  operator  must  be  replaced  at  the  boundary  with 
an  appropriate  boundary  operator. 

Approximation  to  boundary  operators  are  well 
known  which  work  well  for  waves  which  are  propagating 
towards  the  boundary  near  normal  incidence.  However, 
problems  occur  in  cases  where  sources  arc  far  from  the 
center  of  the  model  or  if  the  velocity  field  is  not  htvmo- 
geneous.  In  such  cases  results  are  contaminated  with 
noise  scattered  back  from  the  artificial  boundaries. 

A  nonlinear  least  squares  method  is  proposed  for 
determining  an  absorbing  boundary  operator.  The  oper¬ 
ator  is  chosen  by  demanding  that  waves  traveling  within 
a  predetermined  cone  arc  alternated  as  much  as  possible. 
The  problem  is  solved  with  a  Monte  Carlo-type  minimi¬ 
zation  method.  Storage  requirements  can  be  reduced  by 
reducing  the  grid  size.  Results  can  be  obtained  from  the 
vectori7.ation  of  the  method. 

“Large-Scale  Computing  in  Reservoir  Simulation" 

H’chard  Ewing,  University  of  Wyoming. 

The  objective  of  reservoir  simulation  is  to  understand 
the  complex  chemical,  physical,  and  fluid  flow  processes 
occurring  in  a  petroleum  reservoir  sufficiently  well  to  be 
able  to  optimize  the  recovery  of  hydrocarbon.  For  this, 
mathematical  and  computational  models  must  be  built 
capable  of  predicting  the  performance  of  the  reservoir 
under  various  usable  schemes.  Many  of  the  physical  jihe- 
nomena  which  govern  enhanced  recovery  processes  have 
very  important  local  character.  Therefore,  the  models 
u.sed  to  simulate  the.se  processes  must  be  capable  of  re¬ 
solving  these  critical  local  features. 

Mathematical  models  of  enhanced  recovery  pro¬ 
cesses  involve  large  coupled  systems  of  nonlinear  partial 
differential  equations.  In  order  to  compare  the  results  of 
these  models  with  physical  measurements  to  assess 


their  validity  and  to  make  decisions  based  on  these  mod¬ 
els,  the  partial  differential  equations  must  be  discretized 
and  solved  on  computers.  Field-scale  hydrocarbon  simu¬ 
lations  normally  involve  reservoirs  of  large  size.  Uni¬ 
forms  gridding  on  the  length  scale  of  the  local 
phenomena  would  involve  systems  of  discrete  equations 
of  such  size  as  to  make  solution  on  even  the  largest  com¬ 
puters  prohibitive.  Therefore,  local  grid  refincmem  ca¬ 
pabilities  and  efficient  solution  processes  are  becoming 
more  important  in  reservoir  simulation  as  the  enhanced 
recovery  procedures  being  used  become  more  complex, 
involving  more  localized  phenomena  in  enormous  prob¬ 
lems. 

Equations  representing  the  miscible  displacement  of 
one  i.icompressible  fluid  by  another,  completely  mLscible 
with  the  first  are  combined  and  lead  to  equations  describ¬ 
ing  multiphase  and  multicomponent  How  in  porous 
media.  These  can  be  used  to  simulate  various  production 
Strategies  in  an  attempt  to  understand  and  optimize  hy¬ 
drocarbon  recovery. 

In  miscible  or  multicomponent  How  models,  the  con¬ 
nective,  hyperbolic  part  of  the  equation  is  a  linear  func¬ 
tion  of  the  fluid  velocity.  The  operator-splitting 
technique  applied  to  a  variational  method  leads  to  a  sym¬ 
metric  bilinear  form.  A  modified  method  of  charactcris- 
ties  is  used  to  treat  the  time  stepping.  The  discretization 
methods  used  can  be  considered  as  the  first  ste[)  in  a  New¬ 
ton  linearization  of  the  coupled  nonlinear  sy.stem.  The 
method  is  designed  to  linearize  and  formally  decouple  the 
equations  for  a  sequential  solution  process.  However,  in 
cases  where  the  nonlinearitics  in  the  partial  differential 
equations  are  strong  this  lineaii/.ation  process  is  not  suf¬ 
ficiently  accurate  for  the  desired  application.  In  such 
cases  the  full  Newton-Raphson  type  of  treatment  can  be 
used. 

“ParaScope:  A  Parallel  Programing  Environment" 

Ken  Kennedy,  like  University. 

Clearly,  future  generations  of  scientific  supercompu¬ 
ters  will  employ  multiple  independent  processors.  What 
form  of  programing  support  software  should  be  provided 
with  such  machines?  Existing  FORTRAN  program.s, 
written  for  sequential  machinc.s,  are  not  well  suited  U) 
parallel  execution.  If  these  programs  arc  to  run  efficient¬ 
ly  on  a  multiprocessing  system,  they  mu.st  be  decom¬ 
posed  into  subproblems  that  can  be  executed  in 
parallel. 

Although  there  hrs  been  substantial  progress  in 
methods  for  automatic  transformation  of  sequential  pro¬ 
grams  to  parallel  form,  there  is  little  evidence  that  these 
methods  will  make  it  possible  for  the  programer  to  be  un¬ 
concerned  about  parallelism.  We  must  therefore  assume 
that  parallel  programs  will  be  written  by  human  progra- 
mers  in  an  explicit  parallel  notation. 

Explicit  parallel  programing  is  a  challenging  activity 
fraught  with  opportunity  for  error.  If  programers  arc  to 
be  p  ductivc  on  1  he  next  generation  of  machines  they  will 


need  powerful  new  tools  to  assist  in  the  programing  pro¬ 
cess.  The  ParaScope  project  at  Rice  University  is 
planned  to  provide  such  tools  in  the  context  of  an  inte¬ 
grated  programing  environment. 

ParaScope  is  based  on  a  sophisticated  environment 
for  FORTRAN  programing  developed  over  the  past  5 
years  at  Rice.  In  addition  to  the  usual  tools,  such  as  edi¬ 
tors,  compilers,  and  source-level  debuggers,  ParaScope 
will  incorporate  new  tools  specifically  designed  for  par¬ 
allel  programing,  including  an  editor  that  interactively  re¬ 
ports  potential  sources  of  inadvertent  data  sharing 
between  parallel  processes,  a  compiler  than  analyzes  the 
whole  program  to  produce  good  parallel  code,  a  de¬ 
bugger  that  attempts  to  execute  parallel  programs  ac¬ 
cording  to  a  schedule  likely  to  recreate  data  sharing 
errors  and  performance  visualization  tools  that  help  the 
users  identify  run  time  bottlenecks  in  their  programs.  A 
central  theme  in  the  design  of  the  system  is  the  use  of  deep 
program  analysis  methods,  developed  for  automatic 
transformalirtn  sy.stcms,  in  the  programing  and  debugging 
tools. 

"Current  Directions  and  Future  Possibilities  in  Com¬ 
putational  Fluid  Dynamics  ' 

Anthony  Jameson,  Ih^inceion  University. 

This  paper  covered  a  wealth  of  material,  but  the  talk 
moved  too  fast  for  easy  following.  The  speaker  reviewed 
mathematical  models  suitable  for  different  flight  regimes, 
and  current  developments  in  the  design  of  algorithms  for 
their  numerical  simulation.  Estimates  of  corresponding 
computational  requirements  of  both  speed  and  memor;' 
were  included,  and  the  impact  of  massively  parallel  archi¬ 
tectures  on  future  possibilities  for  numerical  simulation 
of  fluid  flows  was  assessed. 

The  whole  emphasis  was  on  computational  aerody¬ 
namics,  which  requires  identification  of  the  relevant 
physical  phenomena  and  the  formulation  of  appropriate 
models.  In  the  solution  of  such  problems  there  is  a  role 
for  mathematics  (including  numerical  analysis),  com¬ 
puter  science  (including  how  to  prove  a  program  correct), 
aeronautical  engineering  (including  what  is  the  objec¬ 
tive). 

The  specific  objective  is  to  calculate  the  flow  pattern 
of  the  air  past  the  aeroplane.  This  involves  calculating  the 
flow  past  the  aeroplane  in  different  flight  regimes  and  re¬ 
quires  interactive  information.  The  flow  pattern  will  in¬ 
volve  geometric  complexity  and  must  take  into  account 
viscous  effects. 

The  steps  involved  in  the  process  include; 

•  The  choice  of  a  mathematical  model  (this  choice  ran¬ 
ges  from  a  Laplace  equation  for  ideal  fluid  flow  to  Na- 
vier-Stokes  equations  for  complex  cases) 

•  Analysis  of  the  model  chosen 

•  Derivation  of  a  numerical  approximation  to  the  partial 
differential  C(|ualions  involved 
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•  Writing  of  ;i  program  to  solve  the  approximately  equ¬ 
ation 

•  Validation  of  the  model  and  the  program. 

The  Euler  equations  in  aerodynamics  fall  between 
the  Laplace  equation  and  the  Navier-Stokes  in  complex¬ 
ity.  In  aerodynamic  design  there  is  normally  a  tradeoff 
between  complexity  of  the  algorithm  and  the  model.  The 
chosen  algorithm  may  be  finite  differences  or  finite  ele¬ 
ments.  It  may  involve  lime  marching  or  be  steady  state. 

"Parallel  Programing  with  Ada" 

Jan  Kok,  Centrum  yoor  liJiskunden  Informatka,  the  Netherlands. 

The  language  Ada  (ANSI/MIL-STD  1815  A,  1983) 
was  primarily  designed  for  the  production  of  large  por¬ 
tions  of  readable,  modular,  portable,  and  maintainable 
software  for  real-time  applications.  In  the  programing 
area  concerned  with  this  production  the  differences  be¬ 
tween  machines,  systems,  languages  and  language  im¬ 
plementations,  experienced  when  transporting  and 
maintaining  programs,  are  a  main  cause  of  errors  in  pro¬ 
grams. 

In  order  to  provide  the  means  for  obtaining  more  re¬ 
liable  software  for  these  applications  the  US  government 
launched  a  significant  program  in  the  1970’s  with  the  re¬ 
sult  that  both  the  specification  was  given  of  a  portable  en¬ 
vironment  for  developing  and  running  software,  and  a 
high-level  language  was  defined  with  properties  that 
should  enhance  the  programing  of  reliable  and  maintain¬ 
able  software.  This  language,  Ada,  offers  standard  and 
readable  language  concepts  for  the  structuring  of  large 
programs,  for  the  specification  of  the  relationship  be¬ 
tween  different  modules  of  a  program,  for  data  abstrac¬ 
tions,  and  for  programing  distributed  computing  with 
clear  tools  for  describing  processes  and  the  communica¬ 
tion  between  these. 

In  this  presentation  the  author  focused  on  the  lan¬ 
guage  as  an  appropriate  tool  for  the  human  user.  High- 
level  languages  have  the  property  that  the  step  from 
algorithms  (formulated  with  natural  language  or  mathe¬ 
matical  notation)  to  programs  in  those  programing  lan¬ 
guages  is  small  and  can  al.so  be  done  in  the  reverse 
direction  due  to  the  readability  of  the  code. 

This  property  comes  along  with  a  high  degree  of  ab¬ 
straction  away  from  particular  hardware  or  system  char¬ 
acteristics,  which  actually  puts  the  burden  of  directly 
addressing  the  machine  possibilities  on  the  specific  com¬ 
pilers.  In  particular  for  parallel  programing,  the  inten¬ 
tionally  standardized  languages  with  clear  and  high-level 
features  arc  rare.  The  necessity  for  high-level  features  is 
not  generally  accepted,  and  the  suitability  of  possible  ex¬ 
pressive  tools  like  those  of  Ada  has  not  been  exten.sivciy 
investigated.  Presumably  many  believe  that  such  tools  are 
not  pos.siblc  with  the  expected  diversity  of  parallel  archi¬ 
tectures. 

With  the  following  de.scription  of  the  Ada  tools  for 
programing  parallel  actions  and  of  the  possibilities  to  ex¬ 


ploit  parallel  architectures  the  author  intends  to  bring  to 
a  broader  forum  the  i.ssuc  of  language  tools  for  parallel 
programing.  This  may  hopefully  result  in  feedback  for  in- 
crea.sing  the  understanding  about  the  applicability  of 
these  tools,  and  also  for  their  improvement  in  Ada  and  in 
other  scientific  languages  for  which  parallel  programing 
tools  arc  under  development. 

The  author  first  reviewed  the  relevant  Ada  concepts 
that  can  be  used  for  parallel  programing,  in  particular  the 
task  concept  and  the  related  declarations  and  statements 
that  can  be  exploited. 

Next,  the  possibilities  were  discussed  for  the  sup¬ 
posed  and  efficient  mapping  of  Ada  tasks  into  existing 
and  imaginable  multiprocessor  architecture.  He  indi¬ 
cated  some  observed  disadvantages  of  particular  lan¬ 
guage  constructs  and  reported  experience  gained  in 
model  exercises  which  can  be  useful  for  the  solution  of 
numerical  problems  as  well. 

Finally,  he  discussed  the  possibilities  for  implement¬ 
ing  in  Ada  parallel  methods  and  for  developing  new  meth¬ 
ods  with  the  help  of  the  readable  Ada  features,  where  this 
development  so  far  has  been  handicapped  by  the  lack  of 
high-level  language  concepts  for  expressing  possible  al¬ 
gorithms  in  actual  programs. 

“Neural  Computing" 

John  Hertz,  NORDITA. 

Neural  computing  is  a  new  concept  in  computing.  It 
is  a  concept  biologically  motivated  and  massively  paral¬ 
lel.  It  has  implications  for  both  hardware  and  software. 
It  will  have  application  in  the  cognitive  area  including  as¬ 
sociative  memory,  recognition,  error  correction,  and  de¬ 
cision  making. 

A  few  key  figures  in  the  origin  of  neural  computing 

are: 

•  McCullough  and  Pitts,  1943,  for  a  network  of  binary 
threshold  units 

•  R.  Rosenblatt,  1960,  for  learning  in  perceptions 

•  E.  Cainicllo,  1961 

•  B.  Widrow,  1962. 

The  things  that  make  possible  further  progress  today 

are: 

•  Progress  in  very-large-scale  integration 

•  Progress  in  neuroscience 

•  Progress  in  behavior  of  large  complex  systems  of  inter¬ 
connecting  units. 

In  biological  systems  cells  receive  electrical  pulses 
from  other  cells  and  each  pulse  raises  the  potential  inside 
the  cell,  depending  on  the  strength  of  the  synaptic  con¬ 
nection;  when  the  potential  in  a  cell  becomes  greater  than 
a  threshold  value  the  cell  fires  a  pulse  of  fixed  strength 
along  the  axon.  This  results  in  a  raising  or  lowering  of  the 
potential  in  a  node  of  cells. 


Formally,  a  neuron  is  a  two-state  system,  either  firing 
or  not  firing.  A  neural  computing  system  differs  fun¬ 
damentally  from  a  conventional  computer; 

•  It  is  more  massively  parallel 

•  It  is  essentially  collective  -  no  programing  i)f  individ¬ 
ual  cells 

•  The  program  is  contained  in  the  synaptic  connections 

•  It  is  robust  against  noise  and  errors 

•  A  few  errors  can  be  tolerated. 

The  neural  computer’s  main  application  will  be  in 
cognitive  computation.  The  structure  of  the  network  is 
highly  important.  The  formulation  allows  a  new  kind  of 
biological  modeling. 

In  physics  terms,  the  dynamics  of  a  spin  system  with 
energy, 

E  =  -1/2  ^  Jjj  S|  Sj  -  ^  hi  Si 
ij  ' 

i.c.,  field  hi  -t-  X  Jij  Sj  acting  on  Si 

j 

Moving  toward  states  of  lower  energy. 

A  random  mixture  of  plus  and  minus  synapses  is  equi¬ 
valent  to  spin  glass,  except  Jij  ^  Jji 

Many  mctaslable  configurations,  firing  patterns,  pro¬ 
duce  synaptic  noise.  To  get  from  one  pattern  to  another 
a  soft  threshold  is  introduced  -  i.e.,  probability  of  firing  is 
increased. 

Contributed  Papers 

As  stated  in  my  introduction  to  this  report,  the  con¬ 
tributed  papers  and  student  scholarship  papers  will  be 
summarized  herein  using  the  speakers’  own  abstracts. 

"Supporting  Distributed  Matrix  Operations  on  a 
Hypercube” 

Clifford  Addison  et  at.,  Chr.  Michelsen  Insiinuc,  Bergen,  Norn'oy. 

The  CMI  High  Level  Library  is  a  package  of  routines 
for  a  message-passing  multiprocessor.  It  was  originally 
designed  to  relieve  the  programer  of  the  details  of  com¬ 
munication  and  data  handling,  especially  in  numerical  fi¬ 
nite  difference  computations.  We  arc  extending  the 
library  to  support  a  general  set  of  distributed  data  struc¬ 
tures  and  operations  for  matrix  and  vector  computations. 

The  object  is  to  offer  the  programer  a  distributed  im¬ 
plementation  of  the  abstract  data  types  -  "matrbe"  and 
"vector"  —  that  is  as  easy  to  use  as  the  conventional  single- 
procc.ssor  implementation  for  matrices  and  vectors  in 
terms  of  arrays.  Actually  the  library  will  supply  a  choice 
of  distributed  representation  —  by  row,  by  column,  by 
block,  dense  or  spar.se  —  but  the  programer  can  choose 
the  dc'ircd  representation  on  grounds  of  efficiency,  and 


then  ignore  the  details  of  its  implementation  (or  even 
change  the  choice  later  if  necessary). 

‘Algorithms  is  for  a  Specialized  Matrix  Systolic  Pro¬ 
cessor" 

L.C.  Aleksandrov  cl  al.,  Center  for  Informatics  and  Computer  Technol¬ 
ogy,  Bulgaria. 

The  paper  concerns  some  algorithmical  aspects  of  a 
project  for  developing  a  high-performance  specialized 
processor  for  fast  matrix  operations  using  a  systolic  array. 
The  architecture  of  the  processor  is  briefly  described  and 
implementations  of  different  linear  algebra  algorithms 
exploiting  the  potential  parallelism  of  the  system  are  con¬ 
sidered.  The  algorithms  include  solving  .systems  of  linear 
equations  by  the  Jordan  and  the  Gauss  methods,  LU  and 
OR  decompositions,  matrix  multiplication,  and  others. 
Despite  of  the  simplicity  of  the  architecture  and  the  low 
technology  requirements,  the  implementations  have  the 
following  advantages: 

•  The  processor  solves  problems  with  arbitrarily  big 
sizes  (limited  only  by  the  amount  of  memory). 

•  The  utilization  of  the  cells  of  the  systolic  array  is  \ery 
clo.se  to  one. 

•  Numerically  stable  versions  of  the  algrmithms  can  be 
implemented. 

•  The  range  of  solvable  problems  is  large  enough  to 
cover  important  application  areas. 

“The  CESAR  Processor" 

VidarS.  Ander.mn.  Nonvegian  Defense  Research  Estahlishmcnt,  Nonvay. 

CESAR  is  a  parallel  processor  programablc  on  vari¬ 
ous  levels.  It  maybe  attached  to  any  32-bit  host  computer 
from  Norsk  Data  through  the  standard  DOMINO  DMA 
Controller  and  the  MulliFunction  Bus  Memory.  CESAR 
has  a  peak  performance  of  320  MFLOPS  and  is  relative¬ 
ly  compact,  implemented  on  13  printed  circuit  boards  oc¬ 
cupying  about  half  a  card  crate.  As  an  example,  IK 
complex  FFT’s  are  computed  in  0.257  milliseconds  on 
average. 

CESAR  is  well  suited  for  tasks  requiring  intensive 
computations  not  dependent  on  the  data  content.  Some 
signal  processing  algorithms  are  typical  examples. 

The  talk  is  intended  to  give  an  introduction  to  the 
CESAR  processor.  Hardware  modules  as  well  as  soft¬ 
ware  tools  are  described  from  an  application  programer’s 
point  of  view.  This  knowledge  is  necessary  to  understand 
the  related  poster  "Signal  processing  with  CESAR"  by  E- 
A  Herland. 

"A  Benchmark  Code  for  Multiprocessor  Vector 
Supercomputers" 

David  V.  Andersen,  National  Magnetic  Fusion  Energy  Computer  Center, 
California,  and  Half  GrtiherandAle.xandre  Roy,  Centre  de  Recherche  en 
Physique  dcs  Plasmas,  Switzerland. 

In  the  comparison  of  supercomputer  performance 
one  preferably  seeks  criteria  that  arc  relevant  to  the  in 
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tended  applications.  For  example,  large  classes  of  prob¬ 
lems  from  physics  and  other  disciplines  often  result  in 
very  large  .systems  of  linear  equations.  In  this  regard,  a 
program  which  solves  such  a  system  efficiently  can  be 
u.sed  as  a  benchmark  to  make  comparisons  among  avail¬ 
able  and  prototypical  machines.  We  have  developed  the 
program  PAMS  (Parallelized  Matrix  Solver)  which  uses 
vectorization  and  multitasking  (simultaneously)  to  solve 
a  problem  that  arose  in  a  3-D  plasma  physics  application. 
For  this  problem  the  matrix  structure  is  tridiagonal  block- 
banded  with  dense  blocks. 

The  code  employs  a  cyclic  reduction  procedure  on 
the  blocks  which  allows  one  to  obtain  an  algorithm  that  is 
potentially  very  fast  on  multiprocessor  vector  supercom¬ 
puters.  The  block-banded  system  (with  dense  blocks)  is 
also  encountered  in  other  applications  as  well  and  there¬ 
fore  can  be  regarded  as  a  good  reference  problem.  Re¬ 
sults  from  testing  PAMS  on  the  CRAY  X-MP,  CRAY-2, 
NEC  SX-2,  Fujitsu  VP-200,  and  the  CDC-205  will  be 
presented.  We  also  intend  to  present  results  from  ETA- 
10  tests  if  we  can  gain  access  to  the  prototype  machine. 
The  value  of  PAMS  as  a  benchmark  for  future  more  mass¬ 
ively  parallel  computers  will  be  discussed. 

"Oivide-and-Conquer  Algorithms  for  the 
Computation  of  the  SVD  of  Bidiagonal  Matrices" 

Dr.  I\'ter  Arhcn;,  Insiiliil  fiir  Informatik,  Stviizerleind. 

Recently  the  divide-and-conquer  algorithm  pro¬ 
posed  by  Cuppen  for  the  computation  of  the  spectral  de¬ 
composition  of  symmetric  tridiagonal  matrices  has 
gained  considerable  interest  due  to  the  revision  and  suc¬ 
cessful  implementation  of  Dongarra  and  Sorensen.  Since 
the  singular  value  decomposition  of  a  bidiagonal  matrix 
is  closely  related  to  the  spectral  decomposition  of  the  tri¬ 
diagonal  B'  B  or  BB*  but  al.so  of 
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there  arc  several  possibilities  on  how  to  apply  the 
divide-and-conquer  algorithm  on  the  singular  value  de¬ 
composition.  In  this  talk  we  present  and  compare  nu¬ 
merically  some  old  and  new  approaches. 

"Lattice  QCO-As  a  Large  Scale  Scientific 
Computation" 

CIivcl'.  Ilnllic,  cinl.,  California  InsliuiteofTccImolofpCnnciirrcntCom- 
/niinnon  Pmjcci,  Pasadena. 

Lattice  OCD  (Quantum  Chromo-dynamics)  is  one  of 
the  most  computationally  intensive  large-scale  scientific 
computations.  It  can  therefore  be  made  to  run  efficient¬ 
ly  on  any  computer.  As  part  of  the  Concurrent  Super- 
computing  Initiative  at  Caltech  (CSIC),  we  have 
benchmarked  Lattice  QCD  on  a  large  number  of  compu¬ 
ters:  CrayX-MP  and  Cray  2  (vector  supercomputers); 
Caltcch/JPL Mark  III,  Intel  iPSC,  and  Ncube  hypercubes 
(MIMD  shared  distributed  memory  computers);  and 
BBN  Butterfly,  Sequent  Balance,  and  Alliant  FX/8 


(MIMD  shared  memory  computers);  and  TMC  Connec¬ 
tion  Machine  2,  and  AMT  Distributed  Array  f’mec.s.soi 
(SIMD  computers).  Herein  wc  explain  the  computation 
required  for  Lattice  (X'D,  describe  and  ciuitrasl  the  di( 
ferent  concurrent  siipei computers  used,  and  iiresenl  the 
results  of  the  Lattice  QCD  benchmarks. 

“On  the  Performance  of  Shared  Cache  for  Multipro¬ 
cessor  Organizations" 

a.M.  Chaudhry  and  J.S.  Bcdi,  Wayne  Stale  University,  Indiana. 

High-.speed  computers  use  cache  memories  to  in- 
crea.se  the  instruction  execution  rate  by  holding  tempo¬ 
rarily  those  portions  of  the  main  memory  which  are 
currently  in  use.  A  memory  reference  is  a  hit  or  a  miss  if 
the  referenced  datum  is  present  or  absent  in  the  cache, 
respectively.  After  a  miss  the  block  containing  the 
desired  datum  is  copied  from  the  main  memory  to  cache 
memory.  The  hit  ratio  is  the  fraction  of  hits  among  all  ref¬ 
erences;  the  miss  ratio  is  the  fraction  of  misses.  In  order 
to  function  effectively,  cache  memories  must  be  carefully 
designed  and  implemented. 

This  paper  studies  the  effects  of  shared  cache  on  the 
performance  of  tightly-coupled  multiprocessor  systems 
in  which  main  mcmtiry  is  also  shared  by  all  the  proces¬ 
sors.  In  shared  cache,  each  processor  is  able  to  access  a 
single  cache,  shared  among  all  processors.  The  private 
cache  and  multicache  sy.slcms  suffer  from  data  cohereiue 
problem.  Another  problem  of  the  private  cache  is  that 
certain  .shared  system  re.sources,  such  as  operating  system 
routines,  may  be  copied  several  times  in  the  cache 
memories  when  they  are  referenced  by  more  than  one 
process.  Shared  cache  allows  dynamic  allocation  of  total 
cache  space  among  the  processors  as  compared  to  fixed 
cache  allocation  per  proces.sor  in  private  cache  .systems. 

"The  Vectorization  and  Parallelization  of  ABAQUS". 

11.  Bell,  IBM,  UK,  and  B.  Karlsson,  Hibbert,  Karlsson  and  Soirenson,  Inc., 
I'rovidcnce,  Rhode  Island. 

ABAQUS  is  a  finite  element  structural  analysis  pack¬ 
age  marketed  by  Hibbert,  Karlsson  and  Sorrenson  Inc., 
of  Providence,  Rhode  Island.  One  of  its  strengths  is  the 
analysis  of  nonlinear  problems.  ABAQUS  is  now  avail¬ 
able  in  a  version  that  has  been  extensively  vectorized  for 
the  IBM  3090  Vector  Facility.  In  addition,  elapsed  times 
have  been  reduced  by  using  3090  central  and  expanded 
storage  to  keep  the  data  arrays  in  storage  rather  than 
using  DASD  files. 

CPU  specdups  relative  to  scalar  in  excess  of  3.0  have 
been  achieved  together  with  elap.sed  lime  reductions  in 
excess  of  6.0.  In  addition,  a  two-way  parallel  version  has 
been  successfully  demonstrated  but  not  yet  made  com¬ 
mercially  available.  Many  parts  of  the  code  have  been 
vectorized  but  the  most  significant  CPU  specdups  have 
come  from  the  wavefront  solver  routine.  This  was  exten¬ 
sively  restructured  so  as  to  cast  the  FORTRAN  in  a  form 
that  would  make  maximum  use  of  the  IBM  VF  compound 
operation  multiply/add. 
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This  paper  describes  the  techniques  used  in  the  vec- 
torization  work  as  well  as  the  parallelization  and  bringing 
of  files  in  storage.  Performance  results  are  also 
presented.  The  work  was  done  by  Hibberl,  Karlsson  and 
Sorrenson,  Inc.  with  the  cooperation  of  IBM,  as  part  of 
IBM’s  worldwide  program  to  assist  vendors  to  enable 
their  packages  for  the  IBM  3090  Vector  Facility.  Tech¬ 
nical  guidance  on  the  IBM  Vector  Facility  hardware  and 
software  was  provided  by  IBM  UK  Ltd.’s  Technical  Sup¬ 
port  function  and  al.so  by  the  IBM  Dallas  Center. 

“Use  of  Processor  Networks  for  Parallel  Polynomial 
Root  Computing" 

I’ll.  Ikrger  and  F.  Ilo.xlui,  Department  InforniatUiue  fCNSEEIHT,  France. 

Solving  polynomial  equations  is  one  of  the  oldest 
problem  in  algebra.  A  large  number  of  algorithms,  to 
compute  the  zeros  of  polynomials,  have  been  developed. 
However,  the  problem  of  improving  them  remains  cur¬ 
rent  because  the  user’s  requirements  (in  the  field  of  sig¬ 
nal  processing,  C.A.D...)  become  more  specific.  A  high 
precision  is  generally  required.  Furthermore,  a  minimal 
computation  time  is  to  be  wished  for  (real-computation 
time  constraint). 

The  use  of  parallel  computers  may  be  a  reply  of  the 
last  request.  On  the  one  hand,  in  many  problems  curren¬ 
tly  treated,  the  data  size  is  not  very  large,  and  the  im¬ 
plementation  to  supercomputers,  whose  peak 
performance  are  some  hundreds  of  Megaflops,  seemed 
unneces.sary.  On  the  other  hand,  for  many  users  it  is  very 
difficult  to  have  access  to  such  computers.  In  this  context, 
the  development  of  methods  on  a  processor  network,  like 
hypercube  for  example,  may  be  interesting  (good 
price/peak  performance  rate). 

The  object  of  this  paper  is  to  present  some  experi¬ 
ments  in  concurrently  computing  the  roots  of  a  high-de¬ 
gree  polynomial  on  a  network  of  transputers.  The 
numerical  algorithm  allows  determination  simultaneous¬ 
ly  of  all  the  roots  of  a  one-variable  polynomial,  whose 
coefficients  are  real  or  complex  numbers.  A  specificity 
of  the  method  is  the  fact  that  a  process  can  be  associated 
with  the  computation  of  one  or  more  roots,  and  then  the 
program  is  easily  distributed  on  a  set  of  weak-coupled 
processors.  The  principal  choices  of  implementation  are 
linked  with  the  strategy  of  data  communication,  the  com¬ 
plexity  of  synchronization,  and  the  type  of  network.  For 
this  aim,  several  versions  are  generated  and  compared. 

Accordingly,  we  shall  present  the  resolution  algo¬ 
rithm  in  its  mathematic;.!  context;  afterwards,  the  crite- 
rions  which  have  led  us  to  elaborate  different  versions; 
and  finally,  numerical  results  and  performances  related 
to  physical  problems,  problems  drawn  from  literature, 
and  randomly  generated  problems  (polynomials  of  de¬ 
gree  s  64).  We  shall  conclude  on  the  efficiency  of  such  a 
configuration  in  the  field  of  numerical  resolution  of  poly¬ 
nomial  equations. 


“The  Spectrum  of  Sums  of  Projections  with  Applica¬ 
tion  to  Parallel  Algorithms  in  Grid  Refinement  and 
Domain  Decomposition” 

Peiier  Bjorstad  and  Jan  Mandel,  University  of  Bergen,  Norway. 

Knowledge  of  the  spectrum  of  sums  of  orthogonal 
projections  can  be  used  to  estimate  the  rate  of  conver¬ 
gence  of  iterative  methods  based  on  additive  formulations 
of  the  underlying  problem.  These  methods  are  interes¬ 
ting  for  use  in  a  parallel  processing  environment. 

We  give  a  precise  characterization  in  the  case  of  two 
project.s,  and  show  how  the  theory  applies  to  both  domain 
decomposition  methods  and  to  algorithms  for  grid  refine¬ 
ment. 

Numerical  results  form  an  Alliant  FX/8  system  will 
be  presented. 

“Block  Cholesky  Factorization  of  Large  Sparse  Ma¬ 
trices  Parallel  Computers" 

Jon  Braekhus,  Veritas  SESAM  Systems,  Norway. 

A  parallel  block  Cholesky  solver  is  developed  based 
on  the  secondary  storage  block  Cholesky  solver  of 
SESAM.  The  new  solver  is  also  a  secondary  storage  sol¬ 
ver,  but  is  prepared  to  take  advantage  of  large  memory. 
The  solver  consists  of  Cholesky  decomposition  and  back 
substitution. 

The  solver  performs  first  a  symbolic  factorization  on 
block  level  to  determine  what  block  operation  will  take 
place.  Then  the  dependencies  between  the  block  oper¬ 
ations  are  found.  This  enables  easy  changing  of  the  se¬ 
quence  to  distribute  the  tasks  and  thereby  of  studying  the 
load  balance. 

This  new  solver  is  tested  on  shared  memory  multipro¬ 
cessors:  Cray-XMP,  Alliant  FX,  and  VAX.  Differences 
due  to  hardware  and  operating  systems  are  studied. 

The  experiments  on  this  solver  includes  variation  of 
several  parameters,  the  most  important  being; 

•  Block  size  (hereby  setting  typical  vector  length,  work 
per  processor  and  file  storage  requirements) 

•  Sequence  of  block  operations  done  in  parallel  (this  will 
affect  memory  usage,  I/O  and  load  balancing) 

•  The  available  amount  of  primary  memory  compared  to 
bandwidth 

The  results  for  block  approach  is  compared  to  the  re¬ 
sults  obtained  for  non-blockcd  system  by  Alan  George  ct 
al. 

The  tests  include  several  of  the  matrices  from  the 
Harwell  Boeing  "Sparse  Matrix  Test  Collection"  and  two 
real  SESAM  analyses.  The  performance  is  compared  to 
the  sequential  Cholesky  .solver  of  SESAM  and  a  general 
sequential  sparse  matrix  solver. 

SESAM  is  also  prepared  for  parallel  execution  on 
network  connected  computers  on  a  higher  level.  This  is 
presented  by  Anders  Hvidsten. 
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"Linear  Programing  on  a  Local  Memory 
Multiprocessor" 

Uichard  Chamberlain,  Intel  Scientific  Computers.  Wilts,  UK. 

Minimizing  a  linear  function  subject  to  linear  equal¬ 
ities  and  inequalities  can  be  a  time  consuming  problem 
when  the  number  of  variable  is  large.  The  standard 
method  to  solve  these  problems  is  the  simplex  method. 

This  paper  investigates  the  use  of  a  local  memory 
multiprocessor  to  solve  linear  programing  problems.  The 
distribution  of  the  data,  the  communications  require¬ 
ment,  the  need  for  duplicated  work  and  the  potential  of 
using  vector  processors  at  each  node  are  discussed.  Nu¬ 
merical  results  on  the  Intel  iPSC  and  the  vector  extended 
iPSC-VX  are  presented. 

"Parallel  Processing  Techniques  of  the  Euler  Equa¬ 
tions  on  the  IBM  3090  VF  Computer" 

S.M.  Chang,  IBM  Corporation,  et  al.,  US. 

The  purpose  of  this  presentation  is  to  discuss  paral¬ 
lel  processing  techniques  on  fluid  applications  with  the 
use  of  the  IBM  3090  computer  with  Vector  Facility. 
Through  the  introduction  of  a  Clebsch  transformation  of 
the  velocity  field,  an  equivalent  set  of  the  Euler  equations 
is  obtained  for  solving  steady,  three-demensional,  trans¬ 
onic  flows.  The  resulting  equations  arc  solved  by  the  fi¬ 
nite  element  method  employing  a  block-structure 
relaxation  scheme. 

The  solution  domain  is  subdivided  into  blocks,  and 
the  equations  are  solved  in  an  uncoupled  form  for  each 
block  with  appropriate  Dirichlet  and  Neumann-type 
boundary  conditions.  In  this  study,  IBM’s  Multitasking 
Facility  (MTF)  is  applied  to  distribute  block  processing 
across  multiple  processors.  MTF  is  a  VS  FORTRAN  fa¬ 
cility  which  provides  the  capability  to  exploit  the  IBM’s 
MVS/Extended  Architecture  (MVS/XA)  operating  sys¬ 
tem  to  allow  a  single  program  to  use  more  than  one  pro¬ 
cessor  simultaneously. 

The  results  of  two  ca.scs  using  the  IBM  3090-200  will 
be  presented  on  the  solution  of  Euler  equations  for  the 
wing-body  problem.  The  first  case  consists  of  24  blocks 
with  8,700  nodes  and  the  second  problem  consists  of  40 
blocks  with  120,000  nodes.  Solutions  will  be  reviewed  in 
term.s  of  c<»niputational  speed  and  the  level  of  parallelism 
achieved  within  the  compulations. 

"Graphical  Interface  for  Large-Scale  Numerical 
Computation" 

Jeremy  Cook,  Chr.  Michetsen  Institute,  Bergen,  Norway. 

It  is  well  known  that  large-scale  numerical  computa¬ 
tion  generates  indigestible  amounts  of  output  data.  De¬ 
velopment  of  such  applications  is  therefore  tedious  and 
time  consuming.  As  an  aid  to  rapid  development  of  this 
type  of  numerical  application,  a  system  is  now  available 
to  help  the  programer  build  a  user  interface  for  his  appli¬ 
cation  in  a  comparatively  short  time. 

A  reasonable  interface  would  consist  of  one  or  more 
graphics  windows  which  display  the  output  data  from  the 


back  end  processor.  With  a  multiprocc.s.sor  machine  it 
should  be  possible  to  display  output  from  individual  pro- 
ccssorsor  from  the  whole  machine.  There  should  also  he 
a  control  panel  where  the  user  is  able  to  enter  file  names 
for  input  and  output  data  as  well  as  controlling  the  flow 
of  data  and  selecting  which  results  to  display  by  ca.sy-to- 
use  menu  options. 

It  would  typically  require  5-25  mandays,  depending 
on  the  level  of  complexity,  to  develop  such  a  user  inter¬ 
face  for  a  parallel  application  with  a  graphics  workstation 
at  the  front  end.  The  goal  for  the  system  described  is  to 
enable  the  user  to  build  a  basic  interface  within  a  window¬ 
ing  environment  in  the  course  of  the  working  day.  With 
such  an  interactive  graphics  interface  the  numerical  ap¬ 
plication  will  be  developed  much  faster  than  normally  at¬ 
tainable.  The  increase  in  productivity  is  difficult  to 
measure  but  is  estimated  to  be  of  order  5,  or  greater. 

The  pilot  system,  initially  aimed  at  users  with  SUN- 
workstation/Hypercube  combinations,  is  implemented 
for  X-windows  and  the  SunView  windowing  environ¬ 
ment.  The  user  interface  is  built  by  specifying  the  layout 
and  function  of  each  window  in  a  configuration  file.  It  is 
al.so  necessary  for  the  user  to  write  a  simple  graphics  func¬ 
tion  which  will  display  the  data  in  the  best  form  for  inter¬ 
pretation. 

This  work  is  part  of  a  CMI  project  to  build  a  user 
friendly  environment  for  large-scale  numerical  computa¬ 
tion. 

"Fully  Vectorizable  Preconditionings  for  Parallel 
Local  Grid  Refinement" 

J.C.  Diaz  et  al..  The  University  of  Tulsa,  Oklahoma. 

Many  time-dependent  problems  involve  both  general 
phenomena  as  well  as  significantly  localized  phenomena. 
These  are  often  critical  to  the  overall  behavior  of  the 
physical  processes  and  are  usually  dynamic  in  nature.  For 
large-scale  physical  modeling,  it  is  frequently  impossible 
to  use  a  uniform  grid,  in  the  numerical  procedure,  which 
is  sufficiently  fine  to  resolve  the  local  phenomena  without 
yielding  an  extremely  large  number  of  unknowns. 

Use  of  dynamic  grid  refinement  has  been  shown  to 
be  a  practical  method  to  approach  these  large  problems. 
A  coarse  grid  is  placed  over  the  domain  and  finer  grids 
are  u.sed  in  those  subregions  where  localized  phenomena 
appear.  In  general,  several  levels  of  refinement  might  be 
necessary  to  achieve  a  given  minimization  of  the  error  in 
the  solution.  Tree-like  data  structures  permitting  the  ef¬ 
ficient  control  of  the  placement  and/or  removal  of  the  fine 
grids  have  been  discussed  by  several  authors. 

In  particular,  a  method  permitting  the  placement  or 
removal  of  overlapping  grids  has  been  shown  to  be  effec¬ 
tive  when  dealing  with  systems  of  hyperbolic  conser¬ 
vation  laws.  Because  of  the  nature  of  the  data  structure, 
grids  at  the  same  level  in  the  tree  are  completely  inde¬ 
pendent  from  each  other  and  can  be  solved  in  parallel. 
This  inherent  coarse  grain  parallelism  makes  this  method 
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even  more  attractive  for  utilization  in  the  modeling  of 
large-scale  problems. 

Our  main  interest  is  in  transport-dominated  diffusion 
problems  which,  in  general,  require  the  use  of  implicit-in- 
limc  discretization  schemes.  As  a  consequence,  large, 
sparse,  nonsymmetric  sy.stcms  of  linear  equations  have  to 
be  solved  in  order  to  advance  the  solution  for  each  inde¬ 
pendent  grid. 

To  further  exploit  the  capabilities  of  today’s  parallel 
vector-computers,  it  is  imperative  to  have  a  scheme  for 
solving  the  sparse  nonsymmetric  linear  systems  which  can 
be  fully  vectorized.  In  this  way,  parallelism  is  achieved 
through  the  distribution  of  the  grids  among  the  parallel 
processes  available,  and  also  by  making  use  of  the  vector 
opei  ations  for  each  parallel  process.  We  use  a  conjugatc- 
gradient-type  method  with  preconditioning  to  solve  the 
sparse  nonsymmetric  linear  system  arising  from  the  dis¬ 
cretization  of  the  physical  model. 

The  main  obstacles  for  complete  vectorization  have 
been  the  preconditioning  calculation,  and  the  apjilication 
step  within  the  iteration.  For  the  matrices  obtained  using 
the  above  point  discretization  operators,  the  cxi.sting  pre- 
conditioners  usually  require  a  block-recursive  procedure 
which  prevents  vectorization. 

Preconditionings  based  on  nested-incomplete-fac¬ 
torization  and  approximate  inverses  have  been  proposed 
by  some  authors.  At  the  innermost  level  of  incomplete- 
factorization  the  approximate  inverse  of  a  tridiagonal  ma- 
trix  is  calculated.  Preconditioning  schemes  for 
nonsymmtric  problems  using  the  Frobenius  norm  minimi¬ 
zation  for  the  determination  of  the  approximate  inverse, 
arc  discussed  herein.  We  derive  a  formulation  of  this  pre- 
conditioner  which  can  be  fully  vectorized. 

The  application  of  this  preconditioncr,  requires  only 
matrix-vector  and  vector-vector  products.  It  can  be  vec¬ 
torized  in  full  if  appropriate  data  structures  are  used  to 
present  the  sparse  matrices. 

Numerical  experiments  indicate  that,  for  a  class  of 
nonsymmetric  problems,  application  is  up  to  504  faster 
than  existing  methods,  such  as  ILV. 

Calculation  of  the  preconditioning  is  somewhat  more 
expensive,  but  the  faster  application  and  the  reduced 
number  of  iterations  necessary  to  minimize  the  error 
more  than  compensate  for  this  drawback.  Some  new  thc- 
orctical  results  concerning  the  properties  of  the  approxi¬ 
mate  tridiagonal  inverse  have  been  obtained  and  will  be 
presented. 

Numerical  results  for  some  sample  transport-domi¬ 
nated  diffusion  problems  to  illustrate  the  performance  of 
the  overall  parallel  method  using  vector-parallel  architec¬ 
tures,  will  also  be  presented. 

"Finite  Element  Optimisation  in  ADA  Using  Auto¬ 
matic  Differentiation" 

l.  C.W.  Dixon  and  M.  Mohseninia,  The  Hatfield  Polytechnic,  UK. 

In  an  earlier  paper  Dixon  and  Moh.scninia  (1987),  the 
autl.  rs  dc.scrihed  an  implementation  of  Rail’s  (1981) 


automatic  differentiation  approach  in  ADA  using  the 
concepts  of  new  data  types  and  overwritten  operators.  In 
that  paper,  the  approach  was  combined  with  the  Trun¬ 
cated  Newton  optimization  algorithm  (Dembo  and  Stei- 
gaug,  1985),  and  tested  on  a  number  of  simple  test 
problems. 

The  automatic  differentiation  approach  has  now' 
been  extended  to  generate  sparse  Hessian  matrices.  The 
finite  element  optimization  approach  to  nonlinear  partial 
differential  equations  has  been  used  to  generate  nonli¬ 
near  optimization  problems  with  large  dimension,  and  re¬ 
sults  of  applying  the  algorithm  to  such  problems  will  be 
presented. 

“Using  Symmetries  and  Antisymmetries  to  Analyze 
a  Parallel  Multigrid  Algorithm:  The  Elliptic  Bound¬ 
ary  Value  Problem  Case" 

Crai^  C.  Douglas  and  liany  f.  Smith,  US. 

We  exploit  symmetry  and  antisymmetry  properties  ol 
a  class  of  elliptic  partial  differential  equations  to  prove 
when  a  particular  parallel  multilevel  algorithm  is  a  direct 
method  rather  than  the  usual  iterative  method.  Nn 
smoothing  is  reejuired  for  this  result.  Examples  arc 
presented,  including  variable  coefficient  ones.  A  connec¬ 
tion  between  our  algorithm  and  domain  decomposition  is 
established,  even  though  this  algorithm  is  more  general 
and  different.  We  also  antilyze  the  parallel  algorithm 
when  it  is  iterative.  We  show  how  to  increase  processor 
utilization  in  this  case.  W'e  analyze  Hackbusch’s  so- 
called  "robust  multigrid"  algorithm  for  some  model  prob¬ 
lems  and  show  that  our  parallel  algorithm  uses  mueh  less 
computer  time  with,  at  most,  the  same  amount  of  storage. 

"Parallel  Implementation  of  the  Boundary  Element 
Method" 

J.B.  Drake  et  at..  Oak  Ridge  Laboratory,  Tennessee. 

In  this  paper  the  implementation  of  the  boundary  ele¬ 
ment  method  (BEM)  on  a  hypercube  is  considered.  A 
program  solving  the  three-dimensional  Laplace  equation 
for  the  electric  field  of  an  electroplating  cell  is  described. 
The  BEM  is  specifically  adapted  to  this  application, 
which  requires  the  solution  of  nonlinear  boundary  con¬ 
ditions.  Asa  con.sequcncc,  the  matrices a.s.socialed  with 
the  prescribed  Dirichlet  values  and  the  prescribed  Neu¬ 
mann  values  arc  formed  separately  and  stored.  A 
step  of  the  algorithm  is  thus  required  to  combine  and  re¬ 
arrange  the  system  of  equations  into  the  standard  form 
AX=B. 

The  matrix /I  is  dense  and  nonsymmetric,  and  recent 
advances  in  the  art  of  solving  den.se  linear  systems 
on  hypercubes  are  taken  into  account  in  the  devel¬ 
opment  of  the  algorithm.  Estimates  of  the  arithmetic 
complexity  at  each  step  of  the  algorithm  and  model  for 
the  communication  costs  arc  u.sed  to  study  the  parallel 
performance  of  the  BEM  is  particularly  well  suited  for 
parallel  solution  and  can  be  implemented  efficiently  on 
a  hypcrcube. 
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"Hypercube  Implementation  of  a  Linear  Systems 
Solver  Using  Tensor  Equivalents" 

l.ivttc  (1c  Fillip  et  III.,  Chr.  Michclscn  In^iiiuie,  Nomay. 

A  new  stationary  iterative  method  for  the  solution  to 
special  linear  systems  developed  by  Dr.  John  de  Pillis  Ls 
implemented  on  an  Intel  iPSC-VX  Hypercubc.  The 
method  finds  the  solution  vector  x  for  the  invertible  n  x  n 
linear  system/l.v=(/-6j!X  =  /where/1  has  real  spectrum. 

The  solution  method  converges  quickly  because, 
through  the  use  of  tensor  products,  an  equivalent  system 
with  a  better  spectrum  is  generated.  The  Jacobi  iteration 
matrix  b  is  replaced  by  the  equivalent  iteration  matrix 
with  a  smaller  spectral  radius.  A  good  approximation  to 
the  spectral  boundaries  of  a  is  a  requirement  for  this  al¬ 
gorithm.  A  method  for  finding  these  spectral  parame¬ 
ters  in  parallel  is  discussed.  The  parallel  algorithm  for 
finding  the  vectors  partitions  A  row-wise  among  all  the 
processors  in  order  to  keep  memory  load  to  a  minimum 
and  to  avoid  duplicate  computations.  The  algorithm  has 
been  fine-tuned  in  order  to  take  full  advantage  of  the  vcc- 
l''r  hardware  on  the  hypcrcube  and  to  further  reduce  run- 
t  ...c.  Example  problems  and  timings  will  be  pre.scnied. 

"Prospectus  for  the  Development  of  a  Linear  Alge¬ 
bra  Library  for  High-Performance  Computers" 

Jaaici  Pcnmicl  el  at.,  C'ourani  Instimie,  Ne^-  York. 

Wc  propose  todcsign  and  implement  a  transportable 
linear  algebra  library  in  FORTRAN  77  for  efficient  use 
on  a  wide  range  of  high-performance  computers.  The 
production  of  such  a  library  for  the  most  commonly  en¬ 
countered  problems  of  linear  algebra  would  have  several 
benefits: 

1.  It  would  facilitate  the  development  of  scientific 
codes  on  high-performance  computers.  This  area  was  re¬ 
cently  identified  by  the  Computational  Science  and  En¬ 
gineering  Initiative  of  the  National  Science  Foundation  as 
in  serious  need  of  development.  The  large  and  growing 
variety  of  machine  architecture  puts  a  heavy  burden  on 
the  scientific  programer  to  use  each  machine  efficiently, 
since  speed  is  the  major  reason  to  use  high-performance 
computers.  The  availability  of  a  highly  efficient  library  for 
standard  linear  algebra  problems  on  each  major  machine 
would  free  the  programer  to  work  on  more  intcre.sting 
parts  of  the  code. 

2.  It  would  increase  the  portability  of  scientific  codes 
between  different  computing  environments.  Programs 
written  largely  in  terms  of  calls  to  a  standard  library  would 
require  less  work  to  tune  to  the  new  computer  architec¬ 
ture,  since  the  library  routines  would  already  be  tuned. 

3.  It  would  improve  the  utilization  of  a  scarce  re¬ 
source.  By  making  efficient,  state-of-the-art  codes  avail¬ 
able  even  to  beginning  u.sers,  more  efficient  use  could  be 
made  of  expensive  supercomputer  cycles. 

4.  It  would  provide  tools  to  aid  performance  evalu¬ 
ation  of  computers.  A  national  study  has  identified  the 
evaluation  of  supercomputer  performance  as  an  area  in 
need  of  development  and  standardization. 


To  realize  these  benefits,  the  new  library  must  satis¬ 
fy  several  criteria.  First,  the  library  must  be  highly  effi¬ 
cient,  or  at  least  "tunable"  to  high  efficiency,  on  each 
machine.  Otherwise  it  will  not  be  useful  for  bcnchmai  k- 
ing  nor  will  it  improve  utilization,  and  users  will  continue 
to  write  their  own  (not  necesstirily  better)  algi'ritlims. 
Second,  the  user  interface  must  be  uniform  across  ma¬ 
chines.  Otherwise  much  of  the  convenience  of  jrort- 
abiiity  would  be  lost.  Third,  the  programs  must  be  widely 
available. 

The  success  of  the  NETLIB  facility  ha*;  demon¬ 
strated  how  useful  and  important  it  is  for  these  codes  to 
be  available  easily,  and  preferably  on  line.  We  propose 
to  distribute  the  new  library  in  a  similar  way,  for  no  cost 
or  a  nominal  cost  only.  In  addition,  the  programs  must  be 
well  documented,  in  the  style  of  the  LINPACK  manual. 
To  achieve  these  goals,  we  propose  a  linear  algebra  li¬ 
brary,  based  on  the  successful  EISPACK  and  LINPACK 
libraries,  with  the  following  further  developments; 

•  Integratior  of  the  two  sets  of  algorithms  into  a  unified 
library,  with  a  systematic  design 

•  Incorporation  of  recent  algorithmic  improv  ements 

•  Restructuring  of  the  algrvrithms  to  make  as  much  use 
as  possible  of  the  Basic  Linear  Algebra  subprograms 
(BLAS).  Use  of  the  BLAS  is  the  basis  of  our  approach 
to  achieving  efficiency,  and  is  discussed  at  greater 
length  in  section  2.2. 

In  short,  a  library  would  become  a  central  p;»rt  of  the 
infrastructure  of  a  growing  high-performance  seienlific 
programing  environment,  much  as  conventional  li¬ 
braries  for  serial  machines  are  e.sscntial  to  conventional 
.scientific  computing. 

"Functional  Languages  for  Scientific  Software" 

Lennart  Edblom,  Insiiiiiie  of  Information  Processing,  Universiiy  of 
Umea,  Sweden. 

During  the  past  few  years  a  number  of  different  par¬ 
allel  computers  have  appeared.  They  all  try  to  exploit 
parallelism  one  way  or  another.  The  architectures  of 
thc.se  computers  arc  however  widely  differing,  and  there 
is  no  common  language  or  language  features  for  program¬ 
ing  parallel  computers. 

Wc  examine  some  of  the  problems  associated  vvilh 
programing  current  parallel  computer  architectures.  Re¬ 
gardless  of  whether  you  are  using  a  multiprocessor  with 
di.stributed  memory  or  a  vector  computer,  you  must  be 
aware  of  the  overall  organization,  and  also  many  of  the 
particular  details  of  the  computer  system  to  use  it  effi¬ 
ciently.  Wc  also  examine  some  aspects  of  the  languages 
currently  used  for  programing  parallel  computers,  ami 
find  that  they  arc  quite  inadequate  to  express  parallelism 
in  a  machine-independent  but  problem-oriented  style. 

Our  conclusion  is  that  these  problems  arc  best  solved 
by  introducing  radically  new  architectures  and  languages. 
We  propose  data  flow  architectures  and  functional  lan¬ 
guages  as  a  possible  solution.  One  property  of  functional 
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languages  is  that  there  is  no  inherent  sequentiality  in  the 
language.  Concurrency  is  implicit  in  a  functional  pro¬ 
gram,  and  both  regular  and  irregular  parallelism,  both 
operator-level  and  process-level  parallelism  arc  equally 
well  represented.  Similarly,  data  flow  architectures  have 
no  concept  of  control  flow.  An  instruction  is  ready  for  ex¬ 
ecution  when  i* "  operand  has  arrived,  thus  highly  concur¬ 
rent  computation  is  possible. 

We  have  chosen  scientific  software  as  our  application 
area,  in  particular  linear  algebra.  We  show  how  some 
well-known  linear  algebra  problems  may  be  coded  in  a 
functional  language.  The  examples  include  matrix  multi¬ 
plication,  gaussian  elimination  and  Cholesky  factoriza¬ 
tion.  Functional  languages  are  found  to  have  several 
advantages,  e.g.,  that  the  functions  are  formulated  on  a 
high  level  and  are  amenable  to  program  transformation. 
Furthermore,  a  compiler  can  easily  extract  a  suitable  gain 
of  parallelism. 

Wc  will  continue  to  investigate  how  a  functional  lan¬ 
guage  for  the  development  and  coding  of  scientific  soft¬ 
ware  should  be  designed.  There  are  several  remaining 
problems,  e.g.  the  efficient  handling  of  arrays,  but  the 
potential  advantages  are  more  than  enough  to  motivate 
continued  research. 

"Coherent  Parallel  C“ 

Edward  IV.  Eelien  et  al,  California  Insiituic  of  Technology, 

Pasadena. 

Coherent  Parallel  C  (CPC)  is  an  extension  of  C  for 
parallelism.  The  extensions  arc  not  simply  parallel  for 
loops;  instead,  a  data  parallel  programing  model  is 
adopted.  This  means  that  one  has  an  entire  process  for 
each  data  object.  An  example  of  an  "object"  is  one  mesh 
point  in  a  finite  element  solver.  How  the  processes  arc 
actually  distributed  on  a  parallel  machine  is  transpar¬ 
ent -the  user  is  to  imagine  that  an  entire  processor  in  a 
distributed-memory  environment  is  dedicated  to  each 
process.  This  simplifies  programing  tremendously:  com¬ 
plex  if  statements  as.sociated  with  domain  boundaries  dis¬ 
appear;  problems  which  do  not  exactly  match  the 
machine  size  and  irregular  boundaries  arc  all  handled 
transparently. 

The  usual  communication  calls  are  not  seen  at  all  at 
the  user  level.  Variables  of  other  processes  (which  may 
or  may  not  be  on  another  processor)  are  merely  accessed 
(global  memory).  The  first  pass  of  the  CPC  compiler 
schedules  the  necessary  communications  in  an  efficient, 
loosely  synchronous  manner.  Processes  in  CPC  are  insu¬ 
lated  from  one  another  and  interact  in  a  deterministic 
manner.  This  allows  tractable  debugging.  Standard  C 
I/O  is  provided,  with  simple  extensions  for  parallelism. 

Naturally,  some  performance  must  be  sacrificed  for 
programing  ease.  Linear  and  near-linear  speedups  still 
occur,  although  with  a  lower  level  of  absolute  perfor¬ 
mance.  Results  and  performance  models  will  be  given. 
CPC  is  not  specific  to  distributed  memory  machines.  At 
the  u'  r  level,  one  sees  only  processes  and  knows  nothing 


of  domain  boundaries,  processor  numbers,  etc.  Im¬ 
plementation  of  this  language  on  other  architectures  is 
natural  —  there  seem  to  be  no  fundamental  problems  with 
CPC  on  shared-memory  parallel  computers  or  fine¬ 
grained  SIMD  computers. 

"Chess  on  a  Hypercube" 

Edward  W.  Eelten  c;  al,  California  Instiiute  of  Technology, 

Pasadena. 

We  have  implemented  computer  chess  on  an 
NCUBE  Hypercube.  The  program  follows  the  strategy 
of  currently  successful  sequential  chess  programs, 
searching  of  an  alpha-beta  pruned  game  tree,  iterative 
deepening,  transposition  and  history  tables,  specialized 
endgame  evaluators,  and  so  on.  The  search  tree  is  de¬ 
composed  into  the  hypercube  using  a  recursive  version  of 
the  principal-varialion-splitting  algorithm.  Roughly 
speaking,  subtrees  are  .searched  by  teams  of  proccs.sors  in 
a  self-scheduled  manner.  Search  times  for  related  sub¬ 
trees  vary  widely  (up  to  a  factor  of  100),  so  dynamic  rec¬ 
onfiguration  of  proce.s.sors  is  necessary  It)  concentrate  on 
"hot  spots"  in  the  tree. 

An  interesting  feature  is  the  global  transposition 
table.  For  this  data  structure  the  hypercube  is  used  as  a 
shared-memory  maehinc.  Multiple  writes  to  the  same  lo¬ 
cation  are  resolved  using  a  priority  system  which  decides 
which  entry  is  of  more  value  to  the  program.  Implemen¬ 
tation  of  the  transposition  table  as  "smart"  shared  mem¬ 
ory  is  crueial  to  the  performance  of  the  program. 

The  program  has  played  in  several  tournaments,  fac¬ 
ing  both  computers  and  people.  Most  recently  it  scored 
2-2  in  the  North  American  Computer  Chess  Champion¬ 
ship. 

"Local  Convergence  of  Nonlinear  Multisplitting 
Methods" 

Or.  Frommer,  Universildi  Karlsruhe,  Wcsi  Germany. 

Multisplitting  methods  for  the  solution  of  a  system  of 
linear  equationsA]i:  =  b  are  based  on  several  splittings  of 
the  coefficient  matrix/l.  In  a  parallel-computing  environ¬ 
ment  each  processor  performs  iterations  corresponding 
to  one  of  the  splittings,  and  the  final  iterate  is  obtained  by 
combining  the  individual  iterates  in  an  appropriate  man¬ 
ner. 

In  a  systematic  way  we  now  extend  the  idea  of  solving 
a  nonlinear  system  of  equations  F(x)0.  These  nonlinear 
multisplittings  are  based  on  several  nonlinear  splittings  of 
the  function  F  and  the  corresponding  calculations  can 
again  be  performed  in  parallel.  Each  processor  would 
now  have  to  calculate  the  exact  solution  of  an  individual 
nonlinear  system  belonging  to  "his"  nonlinear  mullisplit- 
ting.  Although  these  individual  systems  are  usually  much 
less  involved  than  the  original  system,  the  exact  solutions 
will  in  general  not  be  available. 

Therefore,  we  consider  important  variants  where  the 
exact  solutions  of  the  individual  systems  are  approxi- 
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mated  by  some  standard  method  such  as  Newton’s 
method. 

We  present  a  local  convergence  analysis  of  the 
nonlinear  multisplitting  methods  and  their  variants.  In 
|)articular  we  will  show  that  these  methods  converge 
linearly  and  that  the  .speed  of  convergence  is  deter¬ 
mined  by  an  induced  linear  multisplitting  of  the  Jaco¬ 
bian  of  F.  It  will  also  turn  out  that  the  speed  of 
convergente  is  not  affected  if  the  individual  systems  arc 
solved  exactly  or  only  approximately  via  Newton’s 
method.  We  include  some  numerical  experiments  to  il¬ 
lustrate  our  results. 

"A  Parallel  Computer  Implementation  in  Finite 
Element  Methods" 

Robert  E.  Fulton  et  at.,  Georgia  Institute  of  Technology,  Atlanta. 

The  paper  reports  on  the  development  and  im¬ 
plementation  of  parallel  processing  software  for  finite 
element  solutions.  A  parallel  FEM  equation  solver  has 
been  developed  and  tested  for  several  static  and  dynamic 
analysis  demonstration  problems.  It  has  also  been  incor¬ 
porated  in  the  production  finite  clement  system  FENRIS, 
with  applications  to  crash  dynamic  lest  problems.  Paral¬ 
lel  procc.ssing  methods  for  transient  analysis  have  been 
studied  using  implicit  and  explicit  numerical  integration 
schemes.  Parallel  software  has  been  implemented  on  sev¬ 
eral  multiprocessor  machines,  including  shared  memory 
and  local  memory  computer  architectures.  The  results 
show  that  a  parallel  processing  approach  can  significant¬ 
ly  reduce  execution  time  for  large-scale  finite  element 
problems. 

"Numerical  Sea  Modelling  Using  Parallel  Vector 
Processing" 

G.  Fumes  et  al.,  llergen  Scientific  Centre,  IBM,  and  Institute  of  Marine 
Research,  Norway. 

Three-dimensional  hydrodynamic  equations  for 
tides  and  wind-induced  flow  in  a  sea  region  are 
solved  numerically  using  two  different  computational 
techni(|ues;  first  by  using  a  single-procc.ssor  computer 
and  then  on  a  parallel  computer  with  a  number  of 
procc-ssors. 

The  model  equations  arc  solved  explicitly  on  a  finite 
difference  staggered  grid  in  the  horizontal  space  domain. 
In  the  vertical  domain  both  expansion  in  terms  of  cigen- 
fuctions  and  finite  difference  box  schemes  arc  con¬ 
sidered.  In  the  time  domain  we  used  forward  time 
stepping.  The  parallel  processing  scheme  described  in 
this  paper  consists  roughly  of  dividing  the  sea  area  into  a 
number  of  subdomains  determined  by  the  number  of  pro¬ 
cessors  available. 

Experiments  with  different  horizontal  resolutions  in 
the  "functional  model"  and  the  "grid  box  model"  are  per¬ 
formed,  and  the  relative  parallel  efficiency  will  be  dis¬ 
cussed. 


"The  Evolution  of  Parallel  Processing  at  CRAY 
Research" 

Mark  iunney,  CRAY Rcseanh,  Mvndata  Uci^^ius,  Mutuesata 

In  l‘t83,  CRAY  Researeh  iiilroduecil.  the  XMI’/J. 
and  the  face  ol  .superc'om|)uling  has  never  been  ihe  .same 
since.  Since  that  time,  4-CPU  and  8-CPU  machines  have 
been  introduced.  This  talk  briefly  describes  the  hardware 
organization  which  promotes  these  first  commercially 
successful  multiproee.ssor  supereomputers  but  concen 
irates  on  the  evolution  of  the  support  software  which  has 
grown  to  deliver  hardware  performance  to  users.  The 
first  effort  (now  termed  Macrotasking)  provided  a  library 
of  FORTRAN-callable  routines  which  implemented  a  set 
of  synchronization  primitives  with  which  users  could  cre¬ 
ate  and  control  multiple  tasks  within  a  single  program. 
This  library  soon  became  a  defacto  industry  standard,  but 
it  did  not  fulfill  all  the  need  of  the  supercomputing  user 
community. 

Microtasking  evolved  from  Macrotasking,  and  its 
very  low  overhead  .synchronization  allowed  new  levels  of 
parallelism  to  be  profitably  exploited.  The  design  of 
Microtasking  will  be  covered  in  some  detail,  including  a 
discussion  of  why  it  works  so  well  for  both  Ivilch-  and 
dedicated-mode  computing,  and  why  it  has  gotten  so 
pcrpular.  The  next  step  in  the  evolution  of  parallel  pro¬ 
cessing  software  (termed  Autotasking)  will  then  be  de- 
.scribed  again  including  a  discussion  of  software  design 
issues,  considerations,  tradeoffs,  and  decisions. 

"The  MMX  Parallel  Operating  System  and  its 
Processor" 

Eran  Gabber,  Tel  Aviv  University,  Israel. 

MMX  (Multiprocessor  Multitasking  executive)  is  a 
small  yet  powerful  operating  system  for  shared  memory 
multiprocessors.  The  MMX  parallel  processor  is  a  small 
shared  bus  multiprocessor  assembled  from  several  Na¬ 
tional  Semiconductor  processor  boards.  Together, 
MMX  and  its  parallel  processor  provide  a  flexible  and 
powerful  testbed  for  parallel  software  development. 

This  paper  describes  MMX  structure,  services  and 
performance.  Parallel  programing  methods  using  MMX 
are  sketched  along  with  timing  and  speedup  measure¬ 
ments  of  several  parallel  programs.  The  paper  concludes 
with  a  brief  description  of  future  research  directions. 

"The  Arithmetic  Mean  Method  for  Solving  Linear 
Dissipative  Systems  on  a  Vector  Computer" 

llio  Gatligani  and  Valeria  Ruggiero,  Universities  of  Bologna  and  Fer¬ 
rara,  Italy. 

This  paper  is  concerned  with  the  implementation  on 
a  parallel  computer  with  a  few  vector  processors  of  the 
arilhmelic  mean  method  for  solving  dissipative  system  of 
the  form 

du(t)  +  A.V(t)  =  bt>0 

ST 

V(0)  =  g 
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where  the  matrix  A  +A  is  symmetric  positive  definite. 
For  example,  such  systems  arise  in  solving  the  initial¬ 
boundary  value  problem  for  the  diffusion-convection 
equation  on  a  rectangular  domain  by  the  method  of  lines. 

In  this  case,^  is  a  large  and  sparse  with  a  nonrandom 
sparsity  pattern.  In  this  note  we  make  the  assumption  that 
the  matrix  y-t  can  be  expressed  as  A  =A/+A2,  where  Ak 
(or  PAkP^,  with  P  a  permutation  matrbe)  is  a  matrbe  of 
simple  structure  (for  example,  triangular  or  tridiagonal). 
Then,  it  is  possible  to  solve  the  system  with  the  arithmetic 
mean  method.  The  consistency  and  the  stability  of  this 
method  have  been  analysed. 

The  method  is  well  suitable  for  parallel  implementa¬ 
tion  on  a  multiprocessor  system  that  can  execute  concur¬ 
rent!;  different  tasks  on  a  few  vector  procc.ssors  with 
shared  central  memory,  such  as  the  CRAY  X-MP/48.  A 
high-level  parallelism  among  independent  tasks  is  offered 
by  the  Cray  multitasking.  An  implementation  on  CRAY 
X-MP/48,  using  microtasking  directives,  of  the  method 
has  been  developed  when^q  is  a  block-tridiagonal  matrix 
and  each  square  block  submatrix  on  the  diagonal  r)f.-t  is 
a  tridiagonal  matrix.  A  detailed  description  of  this  im¬ 
plementation  is  given  and  the  results  of  some  computa¬ 
tional  experiments  carried  out  on  test  block-lridiagonal 
matrices  are  reported. 

"Parallelizing  an  Efficient  Partial  Pivoting  Algorithm" 

John  R.  Gilbert,  Cornel!  Unh  ersity,  blen-  York. 

A  sparse  matrix  can  be  factored  by  Gau.ssian  elimi¬ 
nation  with  partial  pivoting  in  lime  proportional  to  the 
number  of  nonzero  arithmetic  operations,  using  an  algo¬ 
rithm  of  Gilbert  and  Peicrls.  A  sequential  implementa¬ 
tion  of  that  algorithm  is  quite  efficient  in  practice. 

We  obtain  a  shared-memory  parallel  version  of  the 
algorithm  by  using  two  ideas:  Elimination  trees  arc  used 
to  identify  parts  of  the  factorization  that  can  be  per¬ 
formed  independently  in  parallel,  and  the  graph-the¬ 
oretic  structure  prediction  step  in  the  original  algorithm 
is  modified  to  allow  pipelining  of  con.secutive  columns. 
We  present  results  from  an  experimental  implementation 
on  an  Alliant  FX/8  multiprocessor. 

"Parallel  Neural  Network  Simulation  Using  Sparse 
Matrix  Techniques" 

Jeremy  Cook  et  al.,  Chr.  Michelsen  Institute,  Norway. 

Neural  computing  is  an  emerging  concept  in  artificial 
intelligence.  This  new  way  of  programing  attempts  to 
.simulate  the  way  in  which  the  brain  processes  informa¬ 
tion.  The  massive  paralleli.sm  of  the  brain  makes  human 
perception  much  taster  than  pattern  recognition  algo¬ 
rithms  on  conventional  computers.  Neural  networks  arc 
a  natural  framework  in  which  to  implement  applications 
such  as  data  bases,  character  recognition,  speech  recog¬ 
nition,  and  syntax  checking. 

Real  neural  computers  capable  of  significant  cttmpu- 
lalion  do  not  exist  yet,  but  today’s  conventional  parallel 
comp  ‘ers  are  a  good  testbed  on  which  to  simulate  neu¬ 


ral  computers  and  experiment  with  neural  algorithms. 
This  paper  reports  on  the  use  of  a  message-passing  multi¬ 
processor  to  simulate  a  neural  computer. 

Simulating  a  neural  network  efficiently  on  a  multi¬ 
processor  is  related  to  parallel  sparse  matrix  computa¬ 
tion.  One  basic  iteration  of  the  network  is  essentially  a 
matrix-vector  multiplication,  with  addition  replaced  by 
evaluation  of  a  nonlinear  threshold  function  that  models 
the  response  of  a  neuron.  The  network  is  sparse;  most 
pairs  of  neurons  are  not  connected  at  any  given  time. 
Communication  and  load-balancing  issues  are  similar  to 
tho.sc  in  numerical  sparse  matrix  compulation.  However, 
a  learning  network  must  also  periodically  modify  its  con¬ 
nection  weights  (that  is,  the  matrix  values),  or  even  its  con¬ 
nectivity  (that  is,  the  matrix  .structure). 

We  shall  de.scribc  experiments  with  a  neural  network 
simulation  on  the  Intel  iP.SC  hypercube  machine.  The  im¬ 
plementation  is  based  on  a  combination  of  standard 
sparse  matrix  technology  and  dynamic  restructuring  of 
the  network  during  the  course  of  the  com{)Utalion. 

“Image  Analysis  Algorithms  on  Supercomputers" 

l-red  Godtiu  bsen  et  a!.,  The  Norn  ei;ian  In.triiuie  of  Technology,  Nom  ay 

Iterated  Conditional  Modes  and  Simulated  Anneal¬ 
ing  are  two  slandaril  statistical  techniques  for  image  im¬ 
provement  in  image  analysis.  They  may,  howev  er,  be  verv 
lime  consuming. 

The  algorithms  are  applied  in  medical  diagnosis. 

This  paper  gives  implementation  and  example.^ 
tested  on  vector  and  parallel  ctvmputers. 

The  algorithms  are  developed  on  a  CRAY  X-MP/28. 
VV'c  also  plan  tv)  run  them  on  V AX86(K),  Apollo  dn580  and 
Alliant  FX/6.  Speedup-factors  and  execution  times  are 
given  and  di.scussed. 

"Optimal  Power  Scheduling  of  a  Large  Electric  Net¬ 
work  Via  Nonlinear  Programing  on  the  CRAY  X- 
MP/48" 

/..  Grandineeni  and  D.  Conforii,  Univcrsiia  della  Calabria,  Italy. 

A  problem  of  great  practical  interest,  related  to  the 
production,  transmission  and  distribution  of  electric  en¬ 
ergy  in  a  network,  is  taken  into  consideration  and  the  op¬ 
timal  management  policy  for  the  .system  is  formulated  as 
a  nonlinear  mathematical  program.  This  program  is 
characterized  by  large-scale  dimension  and  highly  nonli¬ 
near  constraints;  the  need  for  a  real-time"  numerical  sol¬ 
ution  is  an  additional  distinctive  aspect  of  it. 

A  number  of  mathematically  sound  nonlinear  op¬ 
timization  algorithms,  which  u.sc  gradient  information  of 
the  objective  and  constraint  function.s,  is  selected  with  a 
special  attention  to  those  particularly  suitable  for  a 
proper  matehing  to  the  re.sourccs  offered  by  a  veetor 
supercomputer. 

Analysis  of  numerical  results  suggests  that  the  com¬ 
putation  performevl  on  a  CRAY  X-MP/48,  provides  a  sat¬ 
isfactorily  efficient  solution  of  the  propo.scd  problem,  in 
spile  of  its  severe  computational  characicri.slics. 
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"An  Extension  of  NAG/SERC  Finite  Element  Library 
for  Message  Passing  Multi-Processor  Systems" 

C.  Grixnouf^h  aiul  C.J.  Hi/ni,  Rutherford  Appleton  iMboratoty.  UK. 

During  a  lime  when  concurrent  computing  hardware 
is  developing  quickly  and  software  costs  are  escalating  it 
is  important  to  develop  programing  methodologies  that 
use  a  signilieant  amount  of  existing  serial  software  on 
these  systems. 

In  this  paper  we  present  a  number  of  extensions  to 
the  serial  version  of  the  NAG/SERC  Finite  Element  Li¬ 
brary  which  will  enable  the  users  of  the  serial  Library  to 
make  use  of  the  many  emerging  message  passing  concur¬ 
rent  systems. 

Under  the  basic  requirement  that  the  use  of  existing 
serial  user  programs  should  be  maximized  and  that  the 
general  philosophy  of  program  should  not  change,  a  num¬ 
ber  of  extension  have  been  developed  to  aid  u.scrs  in  ex¬ 
ploiting  multiproccs.sor  systems. 

The  paper  will  address  two  areas;  the  programing 
philosophy  and  the  implementation  of  the  finite  clement 
method.  A  discussion  of  the  method  used  for  domain  de¬ 
composition,  clement  and  .system  matrix  assembly  and  li¬ 
near  algebra  in  relation  to  processor  usage  will  be  given. 

These  extensions  have  been  designed  for  a  general 
multi-processor  system  and  have  initially  implemented  on 
a  hypercube  architecture  and  some  results  using  the  ex¬ 
tended  system  will  be  given.  Some  indication  of  future 
work  will  gisen  particularly  in  the  use  of  transputer  sy.s- 
tems, 

"An  Array  Processor  Architecture  for  Neural  Net¬ 
works  Analysis" 

Anne  Giterin  el  ni,  Institute  Ntuionu!  I’olytec'tiiujue  de  Grenoble, 
I'runce. 

In  the  last  few  years,  neural  networks  analysis  has  de¬ 
veloped  astonishingly.  According  to  this  approach,  the 
study  of  existing  functions  in  the  nervous, system  demands 
some  powerful  simulation  tools.  It  is  obvious  that  some 
processes  usually  are  computed  with  effectiveness  (per¬ 
ception  for  example)  in  the  nervous  system,  the  same  are 
controlled  with  difficulty  by  actual  proce.ssors. 

The  nervous  system  organization  is  different,  is  to¬ 
tally  opposite  the  computer  principles  in  von  Neuman’s 
classical  processing  architecture.  At  the  very  first  level, 
the  nervous  system  is  composed  of  highly  interconnected 
neural  networks.  So  the  two  main  characteristics  arc,  on 
one  hand,  a  great  number  of  simple  cells  (neurons),  on 
the  other  hand,  a  great  degree  of  interconnection  be¬ 
tween  these  cells.  So  this  structure  requires  power  more 
for  communication  control  than  for  computation  in  each 
cell.  Taking  inspiration  from  the  nervous  system  organ¬ 
ization,  the  so-called  "ncuromimclic"  architectures  pro¬ 
vide  an  optimal  combination  of  power  and  speed.  With 
the  improvement  of  VLSI  technology,  it  is  quite  easy  to 
implement  operative  cells,  but  the  challenge  is  to  control 
their  full  interconnection. 


An  alternative  to  these  problems  is  to  build  a  calcu¬ 
lator  according  to  the  suitable  architecture,  which  musi 
provide  a  good  compromise  between  a  gencral-pui  pose 
and  a  dedicated  computer  in  the  class  of  parallel  proces¬ 
sors.  That  is  to  say,  our  aim  is  not  to  implement  in  VLSI 
a  structure  of  neural  networks  directly,  but  it  is  to  create 
an  efficient  arithmetic  configuration  for  simulation  of 
"ncuromimetic'"  networks.  We  propose  an  array  proces¬ 
sor  architecture  with  a  very  simple  interconnection  net¬ 
work  between  the  processing  slices  (processing  element 
with  associated  memories),  in  fact,  this  interconnection 
network  must  be  elementary,  because  for  neural  network 
analysis,  both  scalar  and  vector  processing  abilities  are  re¬ 
quired  together.  These  algorithms  compute  upon  well- 
structured  data  flows  in  relation  with  a  big  amount  of  dal  a 
memorized  in  all  the  processing  slices. 

To  be  efficient,  both  for  scalar  and  vector  processing, 
the  arithmetic  structure  must  be  reconfigurablo.  So  we 
cho.se  the  simplest  arithmetic  array:  a  one-way  linear 
array  which  is  efficient  for  matrix  and  vector  multiplica¬ 
tions  (basic  operations  in  neural  networks).  For  vector 
proces.sing,  the  arithmetic  configuration  is  a  pi[reline 
chain  of  processing  slices.  For  scalar  compulations,  we 
only  break  the  linear  array.  So  each  proces.sing  element 
in  a  parallel  and  autonomous  way  computes  the  complete 
scalar  equations  in  relation  with  its  memory. 

At  last,  in  this  article,  we  describe  a  calculator  named 
""CRASY,"  which  has  reconfigurable  architecture.  The 
processor  CRASY,  as  a  prototype,  is  compo.sed  of  only 
two  slices,  each  is  able  to  perform  20  Mflops.  The  per¬ 
formance  of  the  calculator  is  directly  proportional  with 
the  number  of  processing  slices.  This  modular  architec¬ 
ture  is  very  useful  for  the  extending  the  processor.  It  is 
easy  to  build  a  N-slicc  processor  ;ible  to  perform  2t)\N 
Mflops.  We  plan  to  use  CRASY's  computation;il  power 
in  the  simulation  of  learning  neural  networks  models 
which  constitute  a  new  and  efficient  way  of  adaptive  in¬ 
formation  processing. 

"Parallel  Multigrid  Solver  for  3-0  Anisotropic 
Elliptic  Problems" 

(  le  Gartel,  Center  for  Computer  Science.  West  Germany. 

The  efficiency  of  multigrids  in  solving  clliplie  partial 
differential  equations  depends  es.sentially  on  the  ability  of 
the  "smoothing  operator"  to  reduce  high-frequency  error 
components.  Whereas  for  isotropic  3-D  problems,  point- 
wise  relaxation  has  reasonable  smorrlhing  properties,  lor 
ani,sotropic  cases  line  or  even  plane  relaxation  has  to  be 
used, 

A  parallel  (MIMD,  Utcal  memory)  multigrid  pro¬ 
gram  for  solving  3-D  problems  with  arbitrary  anisotropies 
will  be  presented.  Parallel  line  relaxation  is  based  on  a 
reduction  method,  parallel  plane  relaxation  is  im¬ 
plemented  by  using  suitable  2-D  multigrid  methods.  Nu¬ 
merical  results,  especially  concerning  the  performance 
will  be  represented  and  discussed. 
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Highest  possible  portability  was  one  major  goal  in  the 
program  design.  This  is  achieved  by  means  of  a  general 
and  flexible  library  of  "communication  routines"  which  do 
both  the  mapping  and  the  communication.  Machine-de- 
pendent  language  constructs  are  completely  hidden  in¬ 
side  the  library  routine.s.  This  way,  the  program  can  be 
used  on  different  parallel  and  even  .sequential  machines 
by  simply  adapting  the  communication  routines. 

"The  Use  of  Systolic  Arrays  for  Finite  Element 
Calculations" 

Dr.  Linda  J.  Hayes,  University  of  Tacos  at  Austin. 

Systolic  arrays  arc  a  network  of  very  simple  proces¬ 
sors  which  operate  in  parallel  and  are  usually  designed  to 
be  special-purpose  .systems.  One  characterization  of  sys¬ 
tolic  arrays  is  their  asynchronous  operation.  The  results 
are  passed  between  procc.ssors  as  data  tokens  and  each 
input  travels  through  an  array  of  cells  before  a  final  result 
is  returned  to  memory. 

A  systolic  array  design  will  be  presented  for  doing  fi¬ 
nite  clement  calculations.  In  this  array  each  processor  is 
exiremcly  simple  and  there  is  one  processor  allocated  for 
each  node  in  the  finite  grid.  A  single  systolic  array  design 
will  be  used  not  only  to  generate  the  finite  element  equa¬ 
tions  but  to  maintain  them  at  either  an  clement  or  a  glo¬ 
bal  level  and  also  to  solve  the  resulting  linear  systems  of 
equations.  Each  processor  maintains  one  rowof  the  coef¬ 
ficient  matrix  cither  in  clement  or  global  form. 

Connectivity  and  data  flow  between  processors  is 
dictated  by  the  connectivity  of  nodes  in  the  finite  element 
grid.  The  Hypcrcube  was  used  to  simulate  the  systolic 
array  design,  and  results  are  presented  for  several  tc.st 
cases. 

"Vectorization  of  Arnoldi-Tchebychev  Method  for 
Nonsymmetric  Matrices" 

F.  (  hd(elui  e(  al.,  UiM  Scicnufic  Center,  France 

The  vcctoriz.ation  of  the  Arnoldi-Tchebychev  proce¬ 
dure  for  solving  nonsymmetric  eigenvalue  problems  is 
discussed.  The  procedure  is  based  on  the  iterative  Arnol- 
di  method  in  conjunction  with  the  Tchebychev  accelera¬ 
tion  technique.  New  criteria  are  established  to  identify 
the  optimal  Tchebychev  ellipse  of  the  eigcnspectrum. 

A  simple  method  has  been  developed  to  determine 
the  parameters  of  the  optimal  ellipse  passing  through  two 
eigenvalues  in  a  complex  plane  relative  to  a  reference  of 
complex  eigenvalue.  The  algorithm  is  fast,  rcli.ablc,  and 
docs  not  require  a  scare ii  for  all  possible  cilip.scs  which 
enclose  the  spectrum.  The  procedure  is  applicable  to 
nonsymmetric  linear  systems  as  well. 

"Applications  of  Computational  Fluid  Dynamics  for 
External  Flows  Relevant  to  Offshore  Engineering 
Employing  Supercomputers" 

A/  liercm'ier  et  a!.,  The  Hebrew  University,  Israel. 

The  application  of  computational  fluid  dynamics  has 
been  'lown  to  enhance  the  design  process  for  offshore 


structures.  Due  to  the  geometrical  complexity  of  such 
structures,  the  numerical  models  are  frequently  compre¬ 
hensive  three-dimensional  models  which  entail  the  use  of 
supercomputer  capacity  in  order  to  ensure  solutions  with¬ 
in  the  tight  project  schedules  required  by  the  offshore  in¬ 
dustry. 

The  present  paper  deals  with  the  numerical  solutions 
to  the  incompressible  Navier  Stokes  equations  for  exter¬ 
nal,  wind-induced  flows.  These  flows  are  of  significance 
in  terms  of  environmental,  safety,  and  loading  aspects  for 
offshore  structures. 

The  program  used  in  this  study  .solves  the  Navier 
Stokes  and  continuity  equations  using  the  finite  element 
method  (FEM).  The  FEM  has  certain  advantages  in  the 
generation  of  the  mesh  for  the  complex  geometries  aiul 
boundary  conditions  necessary  fi)r  the  above  applica¬ 
tions. 

The  flow  fields  obtained  using  the  methods  are  dis¬ 
played  together  with  a  brief  discussion  on  the  relevant  tur¬ 
bulence  models.  The  development  time  for  obtaining  the 
results  both  in  terms  of  manhours  and  computational  ef¬ 
fort  is  also  discussed  and  compared  with  alternative  ex¬ 
perimental  methods. 

"Aspects  of  Sparse  Matrix  Technique  on  a  Vector 
Computer" 

Niels  llotibak,  iMb.  for  Ener^iteknik,  Denmark. 

vSparsc  Matrix  Technique  is  most  profitable  when 
rows  only  contain  few  non-zeros  (i.e.,  short  vector  length) 
whereas  the  advantages  of  the  vcctorcomputers  are  most 
evident  for  long  vectors.  This,  as  well  as  the  normally  not 
vectoriz.eable  overhead  in  the  s[>arse  codes,  gives  rise  to 
the  impression  that  sparse  codes  vectorize  poorly. 

In  many  applications  though  — e.g.,  .solving  stiff  sys¬ 
tems  of  ODE’S- it  is  often  the  case  that  many  (almost) 
identical  matrices  have  to  be  factorized  in  the  same  run. 
Exploiting  the  fact  that  the  structure  of  the  LU-factors 
only  need  to  be  computed  once  (or  only  a  few'  lime)  one 
can  dramatically  reduce  the  overhead  and  increase  the 
vectorizeability  of  ail  the  factorizations  but  the  first.  In¬ 
creasing  the  storage  requirements,  one  may  even  reduce 
the  overhead  for  the  third  and  the  following  factoriza¬ 
tions. 

The  various  aspects  hereof  will  be  illustrated  by  runs 
made  on  a  CRAY-XMP  and  on  an  Amdahl/VPIKX). 

"A  Dynamic  Load  Balancing  Scheme  to  Utilize  the 
Parallelism  in  a  ’FE’  Structural  Analysis  Program" 

Anders  Hvidsien,  University  of  Bergen,  Norway. 

The  structural  analysis  program  SESAM  supporting 
a  multilevel  sub.structuring  techniques  is  being  paral¬ 
lelized.  The  parallel  version  of  this  program  is  designed 
to  run  on  different  computer  architectures,  including 
both  a  distributed  net  of  computers  and  shared  memory 
multiproccs-sors. 

The  approach  taken  is  to  view  all  user-defined  sub¬ 
structures  in  a  .structural  model  as  subtasks.  The  actual 
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subtasks  lo  be  performed  are  factorization  and  computa¬ 
tion  of  Schur  complements  when  traversing  in  the  hier¬ 
archy  towards  the  top,  and  back  substitution  when 
trasersing  downward  in  the  hierarchy  after  equation  sol¬ 
ving.  These  subtasks  arc  organized  in  a  shared  pool  of 
work.  Dynamic  scheduling  of  subtasks  in  distributed  and 
nonuniform  sets  of  computers  is  investigated,  and  differ¬ 
ent  objective  functions  are  proposed  for  achieving  good 
load  balancing. 

The  pool  of  work  is  implemented  in  terms  of  a  shared 
mailbox.  This  shared  mailbox  is  the  only  way  processes 
arc  allowed  to  communicate  with  each  other.  Function¬ 
ality  and  performance  of  various  communication  utility 
packages  used  to  implement  interprocess  mailboxes  are 
analyzed. 

The  analysis  is  supported  with  examples  from  two 
parallel  environments:  (1)  a  distributed  net  including 
Sun3/’s  and  Alliant  FX/8  connected  via  an  Ethernet,  and 
(2)  a  Cray  XMP  with  two  processors. 

"Improvements  to  the  Black-Oil  Simulator  (Eclipse 

100)" 

Uddvar  (ijerde,  IliM  Oslo,  Norway. 

Parallelization  of  a  black  oil  simulator  based  on  iso¬ 
lated  geologic  structures  in  the  linear  solver  has  been  pro¬ 
posed  by  Kaarstad  et  al. 

Firstly,  we  have  succeeded  in  making  parallelized 
sections  to  take  into  account  the  possible  number  of 
phases  present  in  each  reservoir.  In  the  previous  im¬ 
plementation  if  one  reservoir  had  three  phases  present, 
then  all  reservoirs  were  treated  as  three-phase,  fn 
the  new  implementation,  we  can  utilize  the  fact  that 
if  any  reservoir  has  only  two  phases  present,  it  will  be 
treated  as  such.  This  will  reduce  the  number  of  equations 
to  be  solved  for  the  two-phase  case  by  factor  of  four- 
ninths. 

The  latest  implementation  has  taken  the  above 
mentioned  proposals  a  step  further.  In  order  to  sim¬ 
plify  this  by  the  need  to  add  minimum  code,  we  have 
renumbered  not  only  the  cells  in  each  reservoir,  as 
was  the  case  in  the  above  mentioned  report,  but  also 
the  planes  so  that  each  reservoir  has  its  own  plane 
count,  similar  to  the  pointers  for  the  cells.  This 
means  that  by  pointing  to  the  first  and  last  planes  in 
each  reservoir,  much  of  the  code  remains  as  before  ex¬ 
cept  for  the  length  of  the  vectors.  Thus  it  is  not  necessary 
to  change  exi, sting  pointers  from  vectors  to  two-dimen¬ 
sional  arrays. 

Furthermore,  we  have  extended  this  method  to  the 
routines  which  construct  the  linear  equations  using  the 
same  approach. 

Lastly,  we  are  also  adding  new  parameters  to  define 
reservoir  boundaries  in  a  more  general  way,  so  that  they 
do  not  need  to  consist  of  a  single  box  delineated  by  verti¬ 
cal  and  horizontal  planes. 


"Parallel  Implementation  Techniques  for  Prolog  on 
the  DAP" 

Kacsiik,  Compiiicr  Hesean  h  and  Innovation  Center,  littni^aty 

The  main  results  of  a  research  on  the  parallel  im¬ 
plementation  of  Prolog  on  the  distributed  array  proces- 
.sor  (DAP)  is  described.  Though  many  projects  arc  unrler 
way  to  implement  Prolong  on  MIMD  computers,  there 
have  so  far  been  no  proposals  for  implementing  Prolong 
on  SIMD  maehines. 

The  underlying  project  proved  four  different  a[)- 
proaches  to  be  viable  for  implementing  Prolog  on  the 
DAP. 

1.  Basically  sequential  implementation  mode  where 
only  certain  parts  of  the  Prolog  interpreter  work  in  par¬ 
allel.  An  example  for  the  parallel  subactivities  is  the  undo 
mechanism  during  backtracking. 

2.  Applying  a  set-oriented  interpretation  mechan¬ 
ism,  where  a  mixed  depth-first/breath-first  search  strate¬ 
gy  is  adopted.  In  this  strategy  the  multiple-fact  branches 
of  a  conventional  Prolog  search-tree  are  considered  as 
generating  binding  sets  rather  than  search  non-determin¬ 
ism. 

3.  SIMD  machines  such  as  the  DAP  are  efficient  al 
data-parallel  rather  than  task-parallel  problems,  enab¬ 
ling  them  to  work  efficiently  with  large,  homogeneous 
data  structures.  Arrays  are  the  most  obvious  so^'warc 
realization  of  the  SIMD  aggregate  of  processing  ele¬ 
ments.  Extending  Prolong  with  arrays  enables  Prolog  to 
be  used  efficiently  in  applications  with  a  large  number 
part. 

4.  A  cellular-datafiow  model  for  executing  logic  pro¬ 
grams  was  succe.ssfully  implemented  on  the  DAP.  The 
close  relationship  between  cellular  automatas  and  DAP 
made  it  possible  to  implement  the  model  in  a  straightfor¬ 
ward  and  elegant  way. 

The  implementation  of  a  Prolog  variant  ^or  the  DAP, 
called  DAP  Prolog,  is  based  on  the  above  techniques. 
DAP  Prolog  is  an  extension  of  ordinary  Prolog  with  ho¬ 
mogeneous  data  structures. 

DAP  Prolog  =  Prolog  +  Sets  -i-  Arrays 

The  paper  summa^i/es  the  main  features  of  DAP 
Prolog  and  gives  a  detailed  description  of  the  parallel  im¬ 
plementation  techniques  mentioned  above. 

"Debugging  Support  for  Parallel  Programs" 

David  W.  Krttmnic  el  a!.,  Tufts  University,  Medford,  Massachusetts. 

The  first  question  in  debugging  is  "What  is  my  pro¬ 
gram  doing?"  All  that  is  needed  for  a  large  portion  of  the 
debugging  process  is  to  answer  basic  questions  regarding 
the  flow  of  execution  and  the  values  of  variables.  On  a 
muJliproce.ssor  with  hundreds  or  thousands  of  compula- 
tionaal  nodes,  extracting  and  utilizing  this  information 
poses  special  problems.  Indeed,  the  difficulties  of  de¬ 
bugging  programs  on  large  mutiproccssors  are  a  major 
impediment  to  the  exploitation  of  the  computational 
power  that  these  machines  offer. 
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This  paper  describes  some  specific  debugging  sup¬ 
port  tools  for  application  programs  developed  as  a  part 
of  a  general  research  effort  in  parallel  program  environ¬ 
ments  of  Tufts  University.  These  tools  are  implemented 
on  a  64-processor  NCUBE  hypercubc. 

In  contrast  to  interactive  debuggers  which  allow  the 
programer  to  probe  for  specific  state  information,  we  are 
concerned  here  with  tools  that  provide  a  general  overall 
picture  of  an  execution,  without  relying  on  the  programer 
to  determine  what  to  probe  for.  We  perceive  a  gap  be¬ 
tween  the  basic  initial  condition  of  total  ignorance  on  the 
part  of  the  programer  and  the  situation  when  the  pro¬ 
gramer  knows  enough  about  what  is  happening  in  an  ex¬ 
ecution  to  probe  intelligently  for  particular  values.  We 
see  great  utility  in  tools  that  allow  a  rapid  progression 
from  the  former  condition  tr)  the  latter. 

There  arc  three  problems  to  solve  in  developing  a  de¬ 
bugging  tool  of  this  sort;  deciding  what  basic  facts  should 
be  conveyed  to  the  programer;  providing  instrumentation 
to  collect  the  relevant  data;  and  creating  a  user  interface 
to  present  it  efficiently.  Our  instrumentation  is  em¬ 
bedded  in  a  custom  operating  system  called  SIMPLEX 
that  we  have  designed  and  implemented  on  our  machine; 
it  is  capable  of  measuring  any  quantity  of  intere.sl,  and  in 
response  to  polling  from  a  host  monitor  process  it  sends 
out  requested  data.  The  user  interface,  called  SEE- 
CUBE,  .solves  the  problem  of  presenting  large  quantities 
of  information  to  the  user  through  carefully  designed 
color  graphics  displays. 

We  distinguish  between  execution  data  that  can  be 
extracted  automatically  for  an  arbitrary  program  and  that 
which  depends  on  advance  planning  by  the  programer. 
Experience  has  confirmed  our  belief  that  a  tool  that  does 
not  require  special  action  by  the  programer  has  import¬ 
ant  advantages  over  one  that  does.  For  example,  in  one 
case  a  programer  noticed  in  a  general  display  involving  a 
supposedly  fully  debugged  program  that  a  troublesome 
interaction  betsveen  nodes  was  occurring  that  was  never 
suspected  of  being  possible,  and  hence  that  would  never 
have  been  probed  for.  When  coordination  among  pro¬ 
cessors  is  organized  around  message  transmissions  or 
other  coarse-grained  events  with  operating  system  invol¬ 
vement,  these  events  provide  the  basis  for  such  automati¬ 
cally  selected  execution  data. 

Message  transmissions  arc  easily  measured  and  they 
are  significant  to  the  programer  since  the  program  has 
been  necessarily  organized  around  them. 

Our  general  status  display  uses  the  following  auto¬ 
matically  collected  items; 

•  The  computational  state  of  each  processor  which  is 
color  coded  to  represent  one  of;  computing,  waiting 
to  read  a  message,  waiting  to  write  a  message,  inactive, 
and  stopped  due  to  fault 

•  The  number  of  available  input  messages  at  each  pro¬ 
cessor,  represented  as  a  single  color-coded  spot  on  the 
disniay 


•  The  number  of  me.ssages  queued  in  the  operating  sys¬ 
tem  for  output  to  a  neighboring  node,  represented  as 
a  color-coded  spot 

•  The  traffic  (number  of  messages  transmitted  between 
pollings)  over  each  communication  channel,  coded 
into  the  coloui  of  the  line  segment  used  to  rcprcscnl 
the  channel. 

This  information  produces  a  dense,  continuously  va¬ 
rying  display  on  a  13-inch  monitor  that  we  have  found 
gives  both  an  easily  comprehended  overall  status  display 
and  enough  detail  to  isolate  interesting  events.  For 
example,  the  overall  amount  and  balance  of  communica¬ 
tion  activity  is  perceived  through  the  apparent  "hotness" 
of  the  communication  links,  communications  bottle¬ 
necks  show  up  as  noticeable  color  spots  representing 
backed- up  input  or  output  messages,  and  overall  load  bal¬ 
ancing  is  apparent  from  the  colors  that  show  whether 
nodes  are  computing  or  wailing. 

Our  first  efforts  with  programcr-planned  data  have 
been  as  follows.  The  program  may  contain  statements  ol 
the  form  "statc(x)'"  to  set  a  processor’s  slate  variable  to  x 
which  is  simply  an  integral  value  between  0  an  255,  with 
the  purpose  of  declaring  dynamically  that  the  program  is 
entering  condition  or  state  x.  Although  the  programer  is 
free  to  invent  arbitrary  meanings  for  these  values,  we 
generally  expect  them  to  encode  the  phases  or  steps  of  al¬ 
gorithms. 

The  operating  .system  and  the  monitor  process  col¬ 
lect  these  values  along  with  the  other  data,  and  the  dis¬ 
play  presents  them  along  with  all  the  other  information 
described  above.  (The  programer  may  define  encodings 
and  color  schemes  to  be  used  in  displaying  these  values.) 
For  example,  conjugate  gradient  algorithms  consist  of 
iterations  of  three  basic  routines;  matrbe  vector  multi¬ 
plies,  inner  products,  and  convergence  checking  (which 
may  be  done  frequently). 

Slate  variables  could  be  u.sed  to  let  the  programer 
know  which  routine  is  being  e.xecuted  in  each  proce.ssor, 
and  to  show  the  general  liming  relationship  among  the 
routines. 

A  further  use  of  state  variables  would  be  this;  Allow 
the  programer  to  define,  at  execution-lime,  a  target  value 
for  the  state  variables,  where  execution  on  each  node  is 
stopped  when  the  node  sets  its  slate  variable  to  the  target 
value.  Or  the  programer  might  specify  that  execution  on 
all  nodes  is  to  be  stopped  when  any  one  of  some  chosen 
.set  of  nodes  reaches  the  target  value. 

By  applying  a  debugger  to  the  slopped  program,  the 
programer  might  interrogate  closely  for  particular  values, 
and  then  continue  execution,  perhaps  with  different  tar- 
gct.s.  This  resembles  the  u.sc  of  breakpoints  in  a  sequen¬ 
tial  program,  except  that  it  does  not  rely  on  the  explicit 
selection  of  locations  in  the  programs  which  may  vary  due 
to  the  execution  of  different  blocks  of  code  on  different 
processors. 
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In  ihc  :i1k)vc  example,  execution  could  be  halted  on 
each  processor  after  the  matrix  vector  multiply  was  com¬ 
pleted  so  that  the  programer  could  check  the  results  by 
probing  with  a  debugger,  or  execution  could  be  halted  the 
first  lime  any  node  reached  the  convergence  checking  to 
see  how  far  along  the  other  nodes  were. 

These  debugging  tools,  except  for  the  target  value 
idea,  have  been  implemented  on  our  hypcrcubc  multipro¬ 
cessor  and  we  are  currently  experimenting  with  them. 
The  software,  including  supporting  software,  should  be 
ready  for  distribution  to  other  sites  in  early  1988.  We 
would  like  to  obtain  feedback  from  applications  progra- 
mers  so  that  we  might  evaluate  and  refine  the  tools. 

“Parallel  Algorithms  for  Solving  the  Triangular 
Sylvester  Equation  on  a  Hypercube  Multiprocessor” 

Bo  Kagstrom  et  at.,  fnsiiime  of  Information  Processing,  Sweden. 

There  are  .several  problems  one  has  to  attack  when 
designing  algorithms  for  a  multiprocessor  architecture. 
Among  these  are  the  choice  of  granularity  of  parallelism 
and  the  scheduling  method  of  compulations.  That  is,  one 
has  to  decide  the  grain-size  of  the  compulation  to  be  car¬ 
ried  out  on  each  processor  and  their  data-dcpcndencies, 
how  these  grains  of  computation  (processes)  arc  to  be 
identified,  and  how  the  processes  are  to  be  scheduled 
among  the  processors.  1  n  this  talk  we  consider  hypercubc 
architectures  with  a  host,  local  memory,  and  message  pas¬ 
sing,  and  no  shared  memory. 

Message-passing  parallel  algorithms  are  develop¬ 
ed  for  solving  the  triangular  Sylvester  equation 
AX  +  XB  =  C,  where  the  unknown  X  is  m  x  n  and  A,  B, 
C  are  given  m  x  m,  n  x  n,  and  m  x  n  matrices,  respective¬ 
ly,  with  real  entries  and  A,  B  are  upper  triangular.  The 
problem  of  transforming/!  and  B  to  upper  trianglar  form 
will  not  be  emphasized  in  this  talk. 

We  discuss  the  inherent  parallelism  of  the  triangular 
.Sylvester  equation  and  introduce  the  concept  of  data  de¬ 
pendency.  In  a  "coar.se  grained"  view  the  computation  of 
X can  be  done  in  three  different  ways:  X can  be  computed 
column-wise,  row-wise,  or  block-wise,  leading  to  three 
different  algorithms. 

The  .sequential  algorithms  arc  used  in  some  modified 
forms  to  build  parallel  algorithms.  In  order  to  make  Ihc 
algorithms  as  efficient  as  possible  for  different  values  of 
m  and  n  and  the  number  of  processors,/?,  we  study  differ¬ 
ent  partionings  and  mappings.  The  matrices /4,  B  and  C 
are  partioned  by  columns,  rows,  and  blocks.  The  map¬ 
ping  methods  used  are  block  and  wrap.  A  theoretical 
analysis  of  the  parallel  algorithms  in  terms  of  arithmetic 
and  communication  costs  is  presented. 

We  provide  details  in  C  programs  that  implement  the 
algorithms  on  a  hypercubc  simulator  and  which  should 
run  with  little  modification  on  real  hypercubc  architec¬ 
tures  (e.g.,  the  Intel  iPSC).  The  performance  of  the  vari¬ 
ous  algorithms  for  different  values  of  m,  n  and  the  number 
of  processors p  is  demonstrated  in  terms  of  parallel  spee¬ 
dup  and  efficiency,  processor  utilization,  and  communi¬ 


cations  costs.  Results  from  real  hypcrcubc  implcincnia- 
lions  will  also  be  reported  (ongoing  work). 

“Parallel  Transonic  Flow  Calculations" 

Choi-Hong  Lai,  Queen  Mary  College,  UK. 

Transonic  flow  calculations  using  a  domain  decom¬ 
position  technique  are  discussed  and  their  performance 
on  an  array  processor  is  investigated.  The  emphasis  of 
this  paper  is  on  the  mapping  strategy,  namely  "sliced  map¬ 
ping,"  for  transonic  flow  calculations.  Advantages  i)! 
using  the  sliced  mapping  for  this  application  are  dis¬ 
cussed  as  compared  with  crinkled  mapping. 

First,  the  transonic  small  perturbation  equation  is  in¬ 
vestigated.  Particular  attention  is  paid  to  the  various  iter¬ 
ation  methods  within  the  mixed  subsonic  and  supersonic 
region.  A  modified  Gauss-Scidel  iteration  is  applied  to 
the  above  region  whieh  shows  a  good  improvement  over 
a  parallel  AF2  technique.  Another  technique  that  has 
been  used  is  to  use  a  linearised  discrete  operator  of  the 
TSP  equation  which  shows  its  advantages  as  compared 
with  the  use  of  the  nonlinear  discrete  operator  for  low 
mach  number  such  as  0.7.  Second,  the  technique  is  gener¬ 
alized  to  transonic  Euler  equations. 

"A  Reconfigurable  Multitransputer  Network  as  a 
T ool  for  the  Experimentation  of  Parallelism  in  Scien¬ 
tific  Computing” 

Pierre  Leca  and  Alain  Cosnuau,  ONERA,  France. 

This  paper  presents  the  studies  done  at  ONERA 
using  the  transputer  as  a  building  block  for  experimenta¬ 
tions  in  the  field  of  scientific  multiproce.ssing. 

The  architecture  of  a  reconfigurable  multiTrans- 
puter  network,  built  with  standard  INMOS  modules,  is 
first  described  and  the  main  hardware  features  of  the 
building  blocks  (T8(X),  C0()4)  are  given; 

Surteen  T800  Transputers,  accessing  32  KB  of  local 
memory,  are  connected  together  in  a  double  ring  configu¬ 
ration  using  two  hardware  bidirectional  links  per  trans¬ 
puter.  The  two  other  bidirectional  links  are  connected  to 
Ihc  IMS  C(X)4  32-way  cro.ssbar  switch.  The  control  of  this 
.switch  is  programed  via  a  hardware  link  of  a  dedicated 
T212  Transputer,  in  such  a  way  that  the  topology  of  the 
16-lranspuler  network  is  dynamically  reconfigurable  by 
software. 

Due  to  the  C004  switch,  this  system  may  be,  for  in¬ 
stance,  configured  as  a  linear  array,  a  double  ring,  a  two- 
dimensional  torus,  or  as  more  complex  networks  if  each 
bidirectional  link  is  split  into  two  one-directional  links, 
and  may  be  programed  horizontally  or  vertically  using  the 
OCCAM  language,  whose  characteristics  arc  briefly  de¬ 
scribed. 

The  description  of  different  types  of  algorithms,  with 
the  OCCAM  translation  and  their  relative  performances, 
that  have  been  implemented  on  this  multiprocessor  is 
given.  It  concerns; 

(i)  Systolic  algorithms,  1-D  or  2-D  systolic  algo¬ 
rithms,  such  as  convolution,  matrix  by  vector,  or  matrix  by 


23 


matrix  products.  Moreover,  the  dynamic  switch  allows 
the  chaining  of  tw'o  distinct  systolic  algorithms. 

(ii)  Coordinating  computations,  this  system  may  be 
seen  as  a  set  of  processors  that  coordinate  their  activities 
by  exchanging  data  via  the  hardware  links.  Linear  recur¬ 
rences  that  appear  in  "Gauss-Seider'-like  algorithms 
match  perfectly  with  this  architecture. 

In  the  future  this  multilransputer  architecture  will 
be  seen  as  a  target  of  a  software  tool  that  automati¬ 
cally  detects  a  potential  parallelism  between  FOR¬ 
TRAN  instructions  at  loops  level.  We  will  give  some 
indications  about  the  possibility  to  automatically  gener¬ 
ate  a  parallel  code  and  a  network  configuration  for  this 
architecture. 

"Nested  Dissection  Orderings  for  Parallel  Sparse 
Cholesky  Factorization" 

John  Lewis,  Hoeing  Computer  Scr\  ices  Co.,  US. 

The  time  required  to  factor  a  sparse  symmetric  ma¬ 
trix  on  a  parallel  computer  is  strongly  affected  by 
symmetric  reorderings  of  the  matrix.  The  data  depend¬ 
encies  that  restrict  parallelism  in  the  factorization  arc 
given  by  the  elimination  tree  induced  by  the  order¬ 
ing.  Thus,  one  standard  measure  of  (and  a  lower  bound 
on)  parallel  completion  time  is  the  depth  of  the  elimina¬ 
tion  tree.  In  previous  work  —  U 1  -  we  have  found  nested 
dissection  orderings  for  general  sparse  matrices  that  ap¬ 
pear  to  be  better  orderings  for  parallel  factorization  than 
the  best  parallel  versions  of  the  best  standard  sequential 
orderings.  The  nested  dissection  orderings,  have  less 
deep  elimination  trees  than  the  parallelized  sequential 
order-ings,  and  their  sequential  measures  of  complexity 
arc  essentially  equivalent  to  those  of  the  sequential  order¬ 
ings. 

The  orderings  in  U1  are  based  on  the  Fiduccia-Mat- 
theyses  graph  partioning  heuristic.  We  have  observed 
that  this  heuristic  is  quite  sensitive  to  a  number  of  im¬ 
plementation  parameters,  making  it  difficult  to  create  a 
generally  effective  ordering  algorithm.  In  addition  the 
graph  partitioning  heuristic  is  sequential  and  is  quite  ex¬ 
pensive.  In  this  work  we  explore  several  alternative  ap¬ 
proaches  to  finding  nested  dissection  orderings  of  similar 
quality.  Our  goal  is  a  cheaper  and  more  robust  ordering 
method. 

Our  approaches  include: 

•  Parallel  implementation  of  variants  of  the  Kcrnighan- 
Lin  and  Fiduccia-Mattheyses  graph  partioning  algo¬ 
rithms 

•  Graph  partioning  by  simulated  annealing 

•  Variants  of  the  George  &  Liu  automatic  nested  dissec¬ 
tion  procedure. 

The  performance  of  lhe.se  ordering  heuristics  is  com¬ 
pared  on  a  set  of  sparse  matrices  obtained  from  actual  ap¬ 
plications. 


"Applying  a  Sequence  of  Plane  Rotations  on  a  Vec¬ 
tor-processing  Machine" 

Jeremy  Du  Croz  el  ai,  \uinericut  .Algorithms  Croup  Ltd.,  UK 

When  eigenvalues  and  eigenvectors,  or  singular 
values  and  singular  vectors,  are  computed  by  means  of  the 
OR  algorithm,  a  substantial  amount  of  time  is  spent  in  ap¬ 
plying  sequences  on  plane  rotations  to  the  rows  or  col¬ 
umns  of  a  matrix.  Indeed,  if  we  consider  just  the  last  steji 
in  computing  the  singular  value  decomposition,  namcK 
computing  the  singular  values  and  vectors  of  a  bidiagonal 
matrix,  the  computing  lime  is  dominated  by  the  lime  spent 
in  applying  plane  rotations  to  update  the  matrices  of  left 
and  right  singular,  it  is  therefore  desirable  to  perform  the 
plane  rotations  as  efficiently  as  possible. 

The  Level  1  BLAS  routine,  .SROT,  can  be  faster  than 
in-line  FORTRAN  code,  but,  especially  (m  a  machine 
with  vector-registers,  it  gives  only  a  limited  speedup  be¬ 
cause  it  lakes  no  advantage  of  the  chaining  between  suc¬ 
cessive  rotations.  Much  better  performance  can  in 
principle  be  obtained  by  a  routine  of  larger  granularity 
(similar  to  that  of  the  Level  2  BLAS)  which  applies  se¬ 
quences  of  plane  rotations.  Such  a  routine  has  been  in¬ 
cluded  in  Mark  1.3  of  the  NAG  Library.  It  can  apply 
sequences  of  plane  rotations  to  cither  the  rows  or  col¬ 
umns  of  a  rectangular  matrix  in  any  of  the  orders 

(1.2) ,(l,3),(l,4),...,(],n) 

(l,n),(2,n),(3,n),...,(n-l,n) 

(1.2) ,(2,3),(3,4),...,(n-l,n) 

(n-l,n),...,(3,4),(2,3),(],2) 

In  order  to  achieve  optimal  use  of  vector- registers,  it 
is  necessary  to  code  the  routine  in  assembly  language.  We 
shall  demonstrate  the  speed  obtainable  with  such  an  as¬ 
sembly  language  version,  and  the  speedup  which  it  con¬ 
fers  on  NAG  Library  routines  for  computing  eigenvectors 
and  singular  vectors. 

We  shall  propose  a  specification  for  the  routine,  and 
recommend  that  efficient  implementations  be  provided 
by  manufactures  in  the  same  spirit  as  for  the  BLAS. 

"Nonlinear  Transport  Calculations  in  1-D  MOSFETs 
Using  a  CRAY  X-MP/48  and  a  Sequent  Balance 
Multiprocessor" 

J.,-i.  MeJnnes  and  SA.  Mugstad,  Unhersiry  of  Strathclyde,  UK. 

The  electrical  conductance  in  a  di.sordered  system 
may  be  stated  in  terms  of  a  set  of  nonlinear  equa¬ 
tions.  In  macro.scopic  .systems  the  nonlinear  effects  are 
often  negligible  or  self-averaging  and  the  appropriate 
c()ualions  linearised.  In  the  case  of  submicron  1-D  MOS- 
FET’s  nonlinear  effects  must  be  retained  in  the  formal¬ 
ism. 

Algorithms  for  solving  the  correspondence  nonlinear 
equations  are  developed  on  both  a  CRAY  X-MP/48  and 
a  Sequent  Balance  multiprocessor.  The  computed  mag¬ 
netic  field  dependence  of  electron  transport  in  submicron 
MOSFET’s  for  varying  applied  gate  voltage  is  compared 
with  experimental  oKservations. 


"Supercomputing  in  Denmark" 

Bjnrnc  Stii;  Aniicrsnn  el  ai,  The  Supercompuier  Group,  The  Danish 
Compuitng  Centre  for  Research  and  education,  Denmark. 

Three  major  university  computing  centres,  RECKU 
(Copenhagen),  RECAU  (Aarhus),  and  NEUCC  (Lyng- 
by)  have  united  to  form  one  supercomputer  centre  for 
Denmark,  called  in  English  the  Danish  Computing 
Centre  f('r  Research  and  Education,  abbreviated  to 
UNDC. 

A  major  event  for  the  Danish  scientific  com¬ 
munity  was  the  recent  installation  of  the  Amdahl  VP 
1100  at  UN1*C.  This  supercomputer  is  intended  for 
joint  technological  research  carried  out  by  the  Dan¬ 
ish  universities,  research  institutes,  and  industrial 
companies.  Besides  the  Amdahl  VP  UNr*C  has  fBM 
3081,  Sperry  1100/92,  CDC  Cyber,  Alliant,  and  several 
V/--  v'es. 

The  Amdahl  VP  1100  has  one  CPU  with  a  peak  per¬ 
formance  of  286  Nfflop/s.  The  vector  and  scalar  instruc¬ 
tions  can  run  simultaneously.  The  30  Gbyte  disk  storage 
is  supplemented  with  IBM  3420  tape  drives.  This  is  in  ad¬ 
dition  to  the  large  core  of  128  Mbytes.  The  lessons  of  the 
first  year  will  be  described,  including  experience  with  the 
national  network  being  established  to  connect  all  univer¬ 
sity  users  in  Denmark  to  the  computer.  The  network  uses 
a  2-Mbit/s  trunk. 

The  NAG  library  has  been  implemented  and  spe- 
ciali/ed,  and  the  basic  sub-routines  (BLAS)  have  been 
adapted  for  the  Amdahl  VP.  Many  user  programs  have 
been  converted,  and  these  have  demonstrated  the  excel¬ 
lent  performance  of  the  vectorizing  FORTRAN  com¬ 
piler.  Also,  the  very  large  memory,  128  Mbytes, 
Cirntributes  to  much  shorter  program  elapse  times,  e.g., 
by  moving  temporary  data  sets  into  main  memory.  Exam¬ 
ples  will  be  given  and  explained. 

"Cycle  Reduction  and  Matrices  with  a  Group 
Structure" 

Hani  Munihe-Kaas,  Nom  egian  Insrimte  of  Technology  Section  for  Nu¬ 
merical  Mathematics. 

Cyclic  reduction  is  a  variant  of  Gaussian  elimina¬ 
tion  for  solving  special  systems  of  linear  equations.  It 
has  been  succc.ssfully  applied  in  fast  Poison  solvers, 
in  direct  parallel  methods  for  solving  tridiagonal  equa¬ 
tions  and  as  a  parallel  preconditioner  for  iterative  meth¬ 
ods. 

In  the  talk  we  will  show  that  cyclic  reduction  maybe 
described  in  the  language  of  group  connected  with  the 
matrix  graph.  This  may  be  done  in  many  different  ways, 
giving  ri.se  to  different  algorithms.  We  investigate  gener¬ 
alizations  of  the  standard  algorithms  and  show  that  the 
generalized  algorithms  perform  better  than  the  standard 
ones  in  solving  simple  tridiagonal  .systems  on  Cray  X-MP. 
Furthermore  we  will  implement  and  compare  different 
parallel  preconditioners  for  iterative  methods,  based  on 
cyclic  rcduction-like  schemes. 


“Evolution  Algorithms  in  Combinatorial  Optimiza¬ 
tion" 

II.  Muhlenbcin  ct  ai,  GcsvUschaft  fur  Mathcmatik  und  Daicm  cunhn- 
tung  ntbil,  West  Germany. 

Evolution  algorithms  for  combinatorial  optimizafiim 
were  proposed  in  the  iy70’s.  They  did  not  have  a  major 
influence.  With  the  availability  of  parallel  computers, 
thc.se  algorithms  will  become  more  important. 

In  this  paper  we  discuss  the  dynamics  of  three  differ¬ 
ent  clas.scs  of  evolution  algorithms:  network  algorithms 
derived  from  the  replicator  equation,  Darwinian  algo¬ 
rithms,  and  genetic  algorithms  inheriting  genetic  infor¬ 
mation. 

We  present  a  new  genetic  algorithm  which  relics  on 
intelligent  evolution  of  individuals.  With  this  algorithm, 
we  have  computed  the  best  solution  of  the  famous  travel¬ 
ing  salesman  problem.  The  algorithm  is  inherently  par¬ 
allel  and  shows  a  superlinear  speedup  in  multiprocessor 
systems. 

"Data  Distribution  and  Communication  for  Parallel 
Analysis  of  3-D  Body-Scan  Data" 

M.G.  Norman  et  ai.  University  of  Ectinhurgh,  UK. 

It  is  widely  anticipated  that  the  ficxibility  of  MIMD 
architectures  will  allow  efficient  parallel  implementation 
of  middle-  and  high-level  vision  applications.  In  truth,  the 
problems  of  achieving  high  levels  of  processor  utilization 
while  keeping  communications  overheads  low,  mean  that 
middle-  and  high-level  vision  applications  can  only  be  im¬ 
plemented  efficiently  on  MIMD  parallel  architectures 
with  great  difficulty. 

This  paper  describes  an  attempt  to  implement  low- 
and  middle-level  vision  algorithms  on  the  MIMD  archi¬ 
tecture  of  the  Meiko  Computing  Surface.  The  image 
being  processed  was  the  three-dimensional  data  pro¬ 
duced  by  medical  magnetic  resonance  imaging.  The  size 
of  the  dataset  meant  that  it  was  not  possible  to  store  it  on 
a  single  processor,  and  it  was  distributed  among  proces¬ 
sors  in  a  way  that  attempted  to  maximize  the  efficiency  of 
a  middle-level  algorithm  —  tracking  of  surfaces  within 
three-dimensional  datasets- while  maintaining  the  effi¬ 
cient  implementation  of  low-level  operations. 

This  paper  also  describes  the  details  of  the  finished 
system  including  the  necessary  communications  harness, 
a  generalization  of  the  Zucker-Hummel  surface  detec¬ 
tion  operator  to  non-square  metric,  and  three-dimen¬ 
sional  surface  display. 

“Diffusion  Limited  Aggregation  -  Model  and 
Methods" 

M.r1.  Nowtny  Ct  ai,  Dcrgcn  Scientific  Ceniiv  lliM,  Norway. 

The  Diffusion  Limited  Aggregation  (DLA)  model 
introduced  in  1981  has  been  used  to  motlel  a  wide 
variety  of  physical  phenomena,  including  viscous  finger¬ 
ing  in  Hcle-Shaw  cells,  fluid-fluid  displacement  in 
porous  media,  electric  breakdown,  electrodeposition,  the 
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formation  of  snowflakes,  and  irreversible  kinetic  aggrega¬ 
tion. 

The  underlying  equation  of  this  model  is  the  Laplace 
equation,  A'cf)  =  0,  subject  to  the  boundary  conditions  that 
tf)  --  1  at  infinity  and  ({>  =0  on  boundary  of  the  aggregate. 
The  boundary  moves  as  the  aggregation  process  pro¬ 
ceeds— producing  fractal  aggregates.  Wc  will  discu.ss 
various  modifications  of  the  original  DLA  algorithm 
which  allow  the  inclusion  of  additional  physical  processes 
such  as  surface  tension.  In  addition,  we  will  describe  the 
concept  of  noise  reduction  and  show  that  contrary  to  the 
accepted  current  philosophy,  noise  reduction  changes  the 
asymptotic  shape  of  DLA  aggregates. 

We  have  implemented  this  algorithm  on  a  six-proces¬ 
sor  IBM  3090  using  Parallel  FORTRAN,  and  will  de¬ 
scribe  how  this  inherently  scalar  algorithm  may  be 
parallelized  and  vectorized. 

"One-Way  Dissection  with  Pivoting  on  the  Hyper¬ 
cube" 

i.-H.  Oleseit  andJ.R.  Oilbert,  Chr.  Michelsen  Insuiuie,  NorKay. 

A  one-way  dissection  ordering  is  used  for  Gaussian 
elimination  with  partial  pivoting  to  solve  2-D  elliptic  fi¬ 
nite  difference  and  finite  element  problems  on  a  hyper¬ 
cube  parallel  processor.  These  problems  can  be  put  into 
a  matrix-vector  fo  n,  Ar  =  /,  where  the  matrbczl  takes 
the  place  of  the  differential  operator,  x  is  the  solution  vec¬ 
tor,  and/  is  the  source  vector.  Using  "normal  width"  sep¬ 
arators  creates  problems  while  triangularizing  the  A  in 
parallel;  an  inordinate  amount  of  communication  is 
needed,  which  destroys  the  benefits  of  using  a  parallel 
computer. 

One  solution  to  this  communication  problem,  which 
is  used  in  this  report,  is  to  use  double-width  separators. 
With  this  type  of  dissection,  the  domain  is  spread  across 
all  the  nodes  without  any  overlap.  Each  processor  is  given 
its  own  part  of  the  source  vector  and  computes  its  own 
part  of  the  stiffness  matrix,/!. 

The  elimination  starts  out  in  parallel;  communi¬ 
cations  is  only  needed  after  most  of  the  elimination 
is  finished  when  the  separators  need  to  be  elimi¬ 
nated.  Back  substitution  is  initially  done  on  the  sep¬ 
arators,  and  then  totally  in  parallel  without 
communication  on  each  node.  This  method  extends 
the  size  of  the  linear  equations  that  can  be  solved  direct¬ 
ly  on  the  hypercube. 

"A  Quadratically  Convergent  Parallel  Eigen¬ 
value  Algorithm  Based  on  Jacobi-Like  Transfor¬ 
mations" 

M.H.C.  Paardekooper,  Tilburg  University,  the  Netherlands. 

This  paper  discusses  a  generalization  of  the  Jaco¬ 
bi  process  (1846)  for  arbitrary  almost  diagonal  nxn 
matrices,  n  even.  In  each  step  of  the  final  stage  1/2 
n  symmetrically  placed  pairs  of  nondiagonal  elements 
are  annihilated.  We  prov  e  that  with  the  caterpillar 
pivi..  strategy  the  sequence  of  transformed  matrices 


converges  to  diagonal  matrix.  The  quadratically  conver¬ 
gent  method,  also  in  case  of  multiple  eigenvalues  is 
appropriate  parallel  computation  on  an  array  proces¬ 
sor. 

Proceeding  to  the  annihilation  process  with  its  local 
information  structure  a  parallel  norm-reducing  Jacobi- 
like  process  transforms  the  initial  matrix  into  an  almost 
diagonal  one,  adapted  to  the  final,  quadratically  conver¬ 
gent  annihilation  method.  In  this  pretreatment  each  step 
in  a  direct  sum  of  iinimodular  shears.  As  a  consequence 
of  the  invariance  of  the  Frobenius  matrix  norm  under  or¬ 
thogonal  transformation  the  norm  reduction  should  be 
discussed  in  theoretical  terms  that  are  invariant  under 
these  transformations.  In  our  modification  of  Sameh's  al¬ 
gorithm  (1971)  the  problem  formulation  with  so-called 
Euclidean  parameters  inproves  and  simplifies  the  proce¬ 
dure  and  the  formulae  in  the  related  minimization  prob¬ 
lem. 

Different  aspects  of  the  method  and  its  successful  im¬ 
plementation  (Communication,  storage  strategy)  on  The 
Delft  Parallel  Processor  will  be  discussed. 

"The  Timing  of  Sort  Algorithms  on  the  Amdahl  1200 
Vector  Processor" 

Uado  Barady,  Vector  ITocessor  Technical  Serv  ices,  California,  USA  and 
Kentchi  Miura,  Mainframe  Division  of  Tujitsu  Ltd.,  Japan. 

Methods  for  sorting  integers  into  ascending  order  on 
the  Amdahl  1200  arc  compared.  Algorithms  have  been 
modified  to  increase  the  performance  of  sorting  integer 
arrays  and  to  provide  the  sorting  of  arrays  using  a  pointer 
list.  A  variant  of  the  radix  sort,  in  the  limit  of  a  binary 
radix,  can  be  fully  vectorized  on  the  Amdahl  Vector  Pro¬ 
cessor;  it  sorts  12,800  elements  in  less  than  5  milliseconds. 
Its  performance  is  compared  to  widely  used  sorting  meth¬ 
ods. 

"Problem  Parallelism  Versus  Processor  Parallel¬ 
ism" 

D.  Parkinson,  Queen  Mary  College,  UK,  and  R.  Francis,  La  Trobc 
University,  Australia. 

The  literature  of  parallel  algorithms  is  full  of  papers 
which  discuss  the  maximal  parallelism  in  tasks.  In  prac¬ 
tice,  real  multiprocessor  .systems  have  fewer  processors 
than  demanded  by  the  algorithm. 

A  typical  example  occurs  in  image  processing  al¬ 
gorithms  on  nxn  images  with  n  =  512  or  larger.  Pro¬ 
cessor  arrays  of  this  size  only  exist  in  imagination  and  so 
the  practical  problem  is  how  to  properly  use  a  smaller 
array. 

The  experience  gained  in  implementing  an  image- 
component-labelling  algorithm  for  512x512  images  on 
32  X  32  DAP’s  will  be  presented.  Relative  timings  of  a  var¬ 
iety  of  strategies  will  be  presented. 

"FORTRAN  as  a  Parallel  Programing  Language’ 

Ph.  D.  John  R.  Perry,  Alliant  Computer  Systems  Corp.,  US. 

This  is  an  overview  of  an  approach  in  c»)mputer  and 
compiler  architecture  that  allows  parallel  programing 


within  the  standard  FORTRAN-??  language.  This  ap- 
(iroaeh  provides  the  capability  to  exploit  imj)licit  fine¬ 
grained  parallelism  of  loop  ileralion.s  atilomulieally,  as 
well  as  course-grained  parallelism  at  the  subroutine  or 
subroutine  tree  level  under  the  more  explicit  control  of 
the  programer.  These  capabilities  are  implemented  in 
the  Alliant  FX/8  and  FX/4  mini-supercomputers  and  the 
FX/FORTRAN  compiler. 

In  this  paper,  we  discuss  solutions  for  overcoming  the 
various  technical  barriers  to  achieving  parallel  processing 
within  the  bounds  of  the  standard  FORTRAN  language. 
These  solutions  include  hardware  features  to  support  ef¬ 
ficient  management  of  parallel  computations,  and  com¬ 
piler  capabilities  in  such  areas  as  dependency  analysis, 
memory  management  for  parallel  computations,  inter¬ 
procedural  analysis,  and  idiom  recognition.  We  present 
various  examples  of  FORTRAN  scientific  programs  to  il¬ 
lustrate  each  of  these  capabilities,  and  performance  re¬ 
sults  to  show  the  parallel  speedup  of  various  classes  of 
problems. 

Finally,  we  discuss  ongoing  efforts  to  evolve  FOR- 
TR.AN  into  a  parallel  processing  language.  In  particular, 
we  ^how  how  features  of  the  proposed  FORTRAN-8X 
standard  contribute  to  the  ability  to  exploit  implicit  par¬ 
allelism,  and  we  survey  the  efforts  underway  to  define  a 
standard  for  extending  FORTRAN  with  explicit  parallel 
processing  constructs. 

"Optimal  Absorbing  Boundary  Operators" 

B.  Rostand  and  J.  Petersen,  Bergen  Scientific  Centre  IBM,  Norway. 

When  computing  solutions  to  the  two-dimen¬ 
sional  wave  equation  in  unbounded  domains  using  fi¬ 
nite  difference  discretization,  an  artificial  boundary  is 
introduced.  A  boundary  condition  which  absorbs  all  out¬ 
ward  propagating  waves  must  then  be  used.  Equivalent¬ 
ly,  the  finite  difference  operator  must  be  replaced  at  the 
boundary  with  an  appropriate  boundary  operator. 

Approximations  to  boundary  operators  have  been 
published.  These  work  well  for  waves  which  are  propa¬ 
gating  towards  the  boundary  near  normal  incidence. 
However,  problems  occur  in  cases  where  sources  are  far 
from  the  center  of  the  model  or  if  the  velocity  field  is  not 
homogeneous.  In  such  cases  one  tends  to  get  results 
which  arc  contaminated  with  noise  scattered  back  from 
the  artificial  boundaries. 

We  propose  a  nonlinear  least  squares  method  for 
determining  an  absorbing  boundary  operator.  The  oper¬ 
ator  is  chosen  by  demanding  that  waves  traveling  within 
a  predetermined  cone  arc  attenuated  as  much  as  possible. 
We  solve  the  problem  with  a  Monte  Carlo-type  minimi¬ 
zation  method. 

We  show  how  the  improvements  in  the  results  can  be 
used  to  reduce  the  storage  requirements  by  reducing  the 
grid  size  and  will  show  results  from  the  vectorization  of 
the  method. 


"A  Divide  and  Conquer  Method  for  the  Orthogonal 
Eigenproblem" 

W.H.Cragg,  Nmal  Postgraduate  School,  Ciihjornia,  and  L.  Kekhet.  Her 
gen  Scientific  Centre  IBM,  Bergen,  Noiseay. 

In  divide  and  conquer  methods  for  cigenproblcms, 
the  original  eigenproblem  is  split  into  several  smaller  sub- 
problems  by  using  rank  on  modifications.  The  subprob- 
Icms  are  then  solved  independently,  and  from  their 
solutions  the  solution  of  the  original  eigenproblem  can  be 
determined.  This  approach  has  been  developed  for  sym¬ 
metric  matrices  by  Cuppen,  Dongarra,  and  Sorsensen 
and  implemented  on  a  parallel  computer  by  Dongarra 
and  Sorensen. 

Their  implementation  demonstrates  that  divide  and 
conquer  methods  are  very  suitable  for  a  parallel  com¬ 
puter.  We  will  describe  a  divide  and  conquer  method  for 
the  computation  of  eigenvalues  and  eigenvectors  of  or¬ 
thogonal  matrices.  Among  the  applications  of  the  ortho¬ 
gonal  eigenproblem  arc  the  computation  of  Pisarenko 
frequency  estimates  in  signal  processing,  and  the  gener¬ 
ation  of  Gaussian  equadrature  rules  for  the  unit  circle. 

"Continuation  of  Parameter-Dependent  Partial  Dif¬ 
ferential  Systems  on  a  Hypercube" 

I).  Roose  ei  at.  Belgium. 

We  present  a  parallel  algorithm  for  the  continuation 
of  finite-difference  discretizations  of  elliptic  differential 
systems  on  a  hypercube  concurrent  processor. 

Continuation  procedures  are  used  for  the  analysis  of 
a  nonlinear  problem  depending  on  a  parameter  -  i.c.,  for 
the  computation  of  a  branch  of  solutions  in  a  function  of 
the  parameter.  In  general,  each  continuation  step  con¬ 
sists  of  a  predictor  and  a  corrector  part.  The  parallel  al¬ 
gorithm  is  based  on  the  continuation  code  PITCON  of 
Rheinboldt  and  Burkardt.  In  this  case  a  predictor  is 
chosen  along  the  tangent  to  the  branch  and  a  Chord-New¬ 
ton  iteration  is  used  for  the  corrector  part. 

Most  of  the  computing  time  during  both  the  predic¬ 
tor  and  the  corrector  steps  is  spent  in  the  solution  of  li¬ 
near  systems  with  a  matrbt  representing  the  linearization 
of  the  original  problem.  For  a  finite-difference  discreti¬ 
zation  of  a  two-point  boundary  value  problem,  this  matrix 
is  a  bandmatrbe  bordered  with  an  additional  row  and  col¬ 
umn. 

An  efficient  solution  of  the  linear  systems  is  obtained 
by  combining  the  block-elimination  algorithm  of  Keller 
and  the  Parallel  band  solver  proposed  by  Saad  and 
Schultz.  For  discretizations  of  a  two-dimensional  elliptic 
problem,  systems  have  to  be  solved  with  a  border  block 
tridiagonal  matrix.  In  this  case  an  algorithm  related  to 
domain  decomposition  is  used.  The  distribution  of  the 
data  among  the  nodes  is  imposed  by  the  solution  proce¬ 
dures  for  the  linear  systems.  We  show  how  the  remain¬ 
ing  parts  of  the  continuation  algorithm  can  be  organized 
efficiently,  taking  into  account  the  particular  distribu¬ 
tions.  Timing  results  for  the  iPSC  hypercube  will  be 
presented. 
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"The  Impact  of  the  IBM  3090  Vector  Facility  on  the 
Data  Analysis  for  JET,  The  Major  European  Nuclear 
Fusion  Project" 

R.T.  Ross  et  ai,  JET  Joint  Undertaking  and  IBM  UK. 

JET,  the  Joint  European  Torus,  is  the  largest  single 
project  of  the  nuclear  fusion  research  program  of  ED- 
RATOM;  The  objective  is  study  of  plasmas  in  conditions 
approaching  those  required  for  a  fusion  reactor.  The  ex¬ 
periment  is  run  in  pulsed  mode  every  15  minutes  and  each 
pulse  yields  some  12  Mbytes  of  experimental  data.  These 
data  are  sent  to  the  recently  installed  IBM  3090-20()E  with 
V'ector  Facility  (VF)  and  a  series  of  data  analysis,  data¬ 
base  storage,  and  display  programs  are  run  in  order  to 
gi\  c  a  detailed  analysis  of  the  pulse  within  the  15-minute 
interp  ilsc  period  to  provide  essential  information  for  the 
subsequent  pulse. 

This  paper  shows  where  the  use  of  the  VF  has 
made  a  significant  impact  on  the  level  of  analysis 
possible  within  the  interpulse  period.  One  program, 
IDENTC,  which  is  particularly  CPU-intensive,  is  cs.sen- 
tial  for  an  accurate  determination  of  the  plasma  bound¬ 
ary  and  the  internal  geometry  of  the  plasma.  It  achieves 
this  by  solving  the  Grad-Shafranor  equation  using  finite 
element  techniques.  The  various  methods  used  to  op¬ 
timize  this  program  to  take  full  advantage  of  the  VF  are 
discussed. 

These  include: 

•  FORTRAN  code  restructuring  to  make  maximum  use 
of  the  multiple/add  compound  vector  operation 

•  The  writing  of  a  key  subroutine  in  IBM  Vector  Assem¬ 
bler  language. 

"Design  Aspects  of  a  Linear  Algebra  Package  forthe 
SUPRENUM  Machines" 

If'.  Ronsch  and  H.  Strauss,  I^ofessor  Dr.  Feilmeicr,  Junker  &  Co.,  Ginbll, 
West  Germany. 

The  SUPRENUM  multiprocessor  computer  sys¬ 
tem  is  designed  to  be  a  distributed  memory  message 
passing  (dmmp-)  machine  expected  to  reach  a  peak 
performance  of  nearly  4  Gigaflop/s.  Each  of  the  256  pro¬ 
cessors  has  a  scalar  as  well  as  a  vector  unit  and  can  be  con¬ 
trolled  by  its  own  program,  although  many  applications 
will  use  the  mode  of  single  program  multiple  data  pro¬ 
graming. 

The  development  of  a  linear  algebra  package  for  this 
machines  has  to  take  into  account  several  aspects;  among 
them: 

•  The  communication  costs  -  i.e.,  the  ratio  of  computa¬ 
tion  speed  to  communication  speed 

•  The  mapping  strategy  for  the  data  objects 

•  The  mapping  consistency  of  the  data  objects 
throughout  the  whole  package 

•  The  load  balancing  of  the  processors  used  in  the  com¬ 
putation 


•  The  numerical  stability  of  the  algorithms  chosen 

•  The  overall  performance  of  the  package  -  i.e.,  the  ef¬ 
ficiency  of  each  routine. 

In  this  paper  we  present  our  methodology  for  data 
partitioning,  the  storage  scheme  chosen  for  the  data  ob¬ 
jects  (matrices,  vectors,  and  scalars),  and  the  scope  of  the 
linear  algebra  package.  It  is  our  plan  to  hide  the  com¬ 
munication  from  the  user  and  to  give  the  parallel  routines 
a  calling  sequence  which  is,  as  far  as  possible,  in  accord¬ 
ance  with  the  corresponding  LINPACK  and  BLAS  rou¬ 
tines. 

In  designing  the  package  some  emphasis  was  also  laid 
on  the  easy  portability  of  the  package  to  other  dmmp-ma- 
chines  such  as  ti.c  Intel  iPSC  and  the  Floating  Point  T- 
Scrics. 

The  package  is  being  implemented  in  SUPRENUM 
FORTRAN  which  is  an  extension  of  FORTRAN  77  with 
some  SX-fcatures  and  communication  instructions  for 
message  passing. 

Some  first  results  from  our  experience  of  the  SU¬ 
PRENUM  preprototype  are  given. 

“Implementation  of  Pre-Stack  Depth  Migration  on 
IBM  3090" 

Ottar  A.  Sandein  and  Tot  stein  Egilson,  Norway. 

In  seismic  processing  it  is  of  utmost  importance  to 
achieve  optimum  spatial  resolution  of  the  subsurface.  In 
particular,  this  is  important  in  complex  geological  struc¬ 
ture  where  the  potential  reservoirs  arc  most  likely  to  be 
located.  Seismic  migration  is  the  processor  that  over  the 
last  years  has  proven  most  successful  in  producing  an  un¬ 
biased  image  of  the  subsurface.  However,  the  full  poten¬ 
tial  of  the  migration  process  is  not  fully  exploited  in 
routine  .seismic  processing  as  the  migration  is  usually 
done  at  the  end  of  the  processing  sequence. 

Hence  we  want  to  demonstrate  a  migration  scheme 
that  is  done  pre-stack  and  on  a  shot  level  in  order  to 
remove  the  smoothing  effects  introduced  by  previous 
processors.  Pre-stack  migration  on  single  shots  makes 
the  process  very  suitable  for  concurrent  processing  as  in¬ 
dividual  shots  can  be  processed  separately  on  each  cpu. 
The  three  main  features  of  the  process  can  be  sum¬ 
marized  as: 

1)  Forward  propagation  of  source  field  to  given 
depth  level 

2)  Backward  propagation  of  receiver  wave  field 
and  finally 

3)  The  imaging  process  at  the  same  level 

All  the  three  processes  can  be  shown  to  be  accom¬ 
plished  through  spatial  convolution,  which  makes  them 
very  attractive  for  vectorization. 

The  same  operations  arc  repealed  for  every  fre¬ 
quency,  which  also  introduces  another  level  of  concurrent 
processing.  Storage  requirements,  concurrent  process¬ 
ing,  and  vectorization  will  be  demonstrated  during  the 
presentation.  Also,  some  numerical  examples  on  realis- 
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tic  synthetic  data  will  be  presented  as  a  demonstration  of 
the  imaging  potential  of  the  method. 

"High  Resolution  Numerical  Simulations  of  Incom¬ 
pressible  Turbulent  Flows  on  the  IBM  3090  Vector 
Multiprocessor" 

Santangelo  et  ai,  IBM  ECSEC,  Italy,  and  D.  Eegras,  Laboratoire  de 
Mctcorologic  Dynamiqtte,  France 

In  recent  years  many  numerical  and  theoretical 
studies  have  been  performed  to  understand  the  dynami¬ 
cal  properties  of  turbulent  flows.  The  present  available 
vector  multiprocessors  allow  for  a  direct  numerical  inte¬ 
gration  of  the  incompressible  Navier-Stokes  equations  in 
two  dimensions  and  with  very  high  resolutions  (1024^). 
Besides  theoretical  motivations,  2-0  turbulence  is  of  in¬ 
terest  to  geophysical  fluid  dynamics,  as  well  as  astrophy¬ 
sics  and  plasma  physics. 

The  main  feature  of  2-D  turbulence  is  the  appear¬ 
ance  of  coherent  structures  inside  the  flow;  these  struc¬ 
tures  are  extremely  stable  and  their  motion  dominates  the 
flow,  suggesting  the  idea  that  only  a  few  degrees  of  free¬ 
dom  would  be  necessary  to  characterize  the  fully  turbu¬ 
lent  motion.  This  scenario  has  been  found  to  be 
consistent  with  the  results  of  many  numerical  simulations, 
independently  of  the  initial  conditions  and  both  for  de¬ 
caying  and  forced  turbulence. 

"Turbulent  Air  Flow  in  Disk  Files" 

M.  Briscolini  et  al.  IBM  ECSEC,  Italy. 

In  this  paper  we  are  concerned  with  numerical 
simulation  of  the  turbulent  air  flow  inside  the  mag¬ 
netic  disk  storage  systems.  Using  the  IBM  3090A'F, 
we  have  investigated  the  flow  pattern  that  arises  be¬ 
tween  and  around  the  rotating  disks  at  high  speed  in  an 
enclosed  space.  We  use  a  pseudospectral  formulation, 
and  in  considering  the  geometry  of  the  system  we  force 
directly  the  boundary  conditions  of  the  velocity  field. 
This  new  computational  technique,  introduced  by  Basde- 
vant  and  Sadourny,  was  later  extended  by  Briscolini  and 
Santangelo.  We  have  examined  this  problem  in  the  two- 
dimensional  approximation  and  in  the  three-dimensional 
case,  and  wc  present  here  a  comparison  between  these 
calculations. 

"The  IBM  Parallel  FORTRAN  Unguage" 

J.F.  Shaw  et  oi,  IBM,  VS. 

The  IBM  Parallel  FORTRAN  language  offers  fa¬ 
cilities  for  small-  through  large-grain  parallel  execution. 
It  provides  automatic  parallelization  and  vectorization  of 
loops,  explicit  language  for  specification  of  parallel  work, 
and  permits  multiple  levels  of  parallel  executions.  The 
language  is  designed  to  run  on  high-performance,  shared 
memory  multiprocessors,  where  dedicated  processors 
may  not  be  appropriate,  and  where  the  number  of  pro¬ 
cessors  may  change  from  moment  to  moment. 

This  paper  will  describe  the  design  of  the  language, 
discussing  the  model  of  programing  it  implements.  The 


language  allows  a  programer  to  distribute  parallel  work 
to  processors,  rather  than  have  processors  coordinate  on 
a  single  piece  of  work.  The  language  stresses  data  inte¬ 
grity  and  isolation  by  providing  explicit  dynamics  control 
over  the  sharing  and  copying  of  common  blocks.  Finally, 
Parallel  FORTRAN  is  intended  to  be  a  language  that  is 
independent  of  the  machine  configuration  and  the  oper¬ 
ating  system. 

"Digital  Reconstruction  of  Images  from  Their  Pro¬ 
jections  Using  a  Parallel  Computer" 

Fiorella  Sgallari  et  al.,  Universita  di  Bologna,  Vniversita  di  Firenze  and 
CINECA  Computing  Center,  ITALY. 

The  problem  of  image  reconstruction  from  projec¬ 
tion  has  arisen  independently  in  a  large  number  of  scien¬ 
tific  fields.  Of  all  the  applications,  probably  the  greatest 
effect  on  the  world  at  large  has  been  in  the  area  of  diag¬ 
nostic  medicine:  computerized  tomography  (CT). 

In  the  last  decade  a  good  number  of  algorithms  and 
computational  techniques  dealing  with  the  reconstruc¬ 
tion  problem  have  been  presented.  The  mathematical  cs- 
.scnce  of  the  procedure  is  the  estimation  and  presentation 
of  a  real-value  function  of  several  variables  from  approxi¬ 
mate  values  of  a  finite  number  of  its  line  integrals  (the  so- 
callcd  X-ray  or  Radon  transform). 

As  far  as  the  numerical  reconstruction  methods  arc 
concerned,  we  have  a  considerable  number  of  algorithms, 
generally  unrelated  with  each  other.  The  reason  perhaps 
can  be  found  in  the  intrinsic  ill-posedness  (in  the  sense  of 
Tikhonov)  of  the  reconstruction  problem:  the  fact  makes 
practically  useless  the  analytical  inversion  formula  for  the 
Radon  transform. 

We  believe  that  the  well-known  ART  and  SIRT  rec¬ 
onstruction  methods  present  some  remarkable  points  of 
interest.  From  a  purely  conceptual  point  of  view,  they  can 
be  placed  in  the  general  framework  of  the  regularization 
techniques  based  on  Tikhonov’s  "smoothing  functional;" 
of  course,  further  generalizations  of  the  existing  methods 
can  be  devised  by  choosing  different  "smoothing  function  - 
als." 

Conversely,  solution  techniques  based  on  relaxa¬ 
tion  algorithms  can  be  easily  implemented  on  paral¬ 
lel  computer.s,  exploiting  their  specialized  arithmetic 
units.  It  is  worth  noting  that  the  convcrgency  of  such  iter¬ 
ative  algorithm  have  been  deeply  investigated  in  the  re¬ 
cent  literature,  taking  into  account  the  relevant  fact  that 
the  updating  process  is  in  principle  completely  asyn¬ 
chronous  (some  authors  define  it  a  "chaotic  Herat  ion  pro¬ 
cess"). 

In  this  paper,  after  some  theoretical  remarks,  an 
example  of  "smoothing  functional"  is  provided.  The  con¬ 
vergence  of  the  related  parallel  iterative  procedure  is 
proved.  The  implementation  on  a  CRAY -X/MP  48  com¬ 
puter  is  described;  experimental  re.sults  and  comparisons 
with  synchronous  algorithm  are  also  reported. 
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'Trends  in  Supercomputing" 

Karl-Einar  Sj0din,  Control  Data  A.B.,  Sweden. 

The  ETA  10  supercomputer  family  includes  the  most 
powerful  supercomputers  of  today.  It  also  has  features 
and  functionality  that  can  be  extrapolated  into  industry 
trends. 

1.  Architecture  and  technology: 

•  Structure  of  processors  and  memoi  ies 

•  Technology  roadmaps  for  logic  memory  and  com¬ 
munication/networking. 

2.  Software  -  operating  system  and  programing  issues; 

•  UNIX  (or  rather  POSIX)  is  from  now  on  the  .standard 
for  .uipcrcomputcr  operating  systems 

•  FORTRAN  will  continue  to  be  the  standard  program¬ 
ing  language  for  years  to  come 

•  Multitasking  using  parallel  processors  is  of  rapidly 
growing  importance. 

3.  Uses  of  supercomputers: 

•  Established  and  new  application  areas 

•  Departmental  supercomputing 

•  Supercomputers  and  workstations 

•  Simulations  and  visualizations 

Do  we  really  need  the  teraflops/terawords  super¬ 
computer? 

"Concurrent  Dynamic  Simuiation  of  Oistriiation  Col¬ 
umns  via  Waveform  Relaxation" 

Anthony  SkjcUum  and  Manfred  Morari,  California  Insliiuie  of  Technol¬ 
ogy  and  Sven  Mattisson,  Lund  Insliiuie  of  Technology,  Sweden. 

The  distillation  column  and  its  variants  are  arguably 
the  most  important  class  of  unit  operations  in  chemical 
plants.  Dynamic  models  consist  of  differential-algebraic 
systems  with  stiff,  nonlinear  ordinary  differential  equa¬ 
tions  modeling  fluid  flow  and  mass  balances,  and  nonli¬ 
near  algebraic  equations  modeling  the  vapor-liquid 
equilibrium.  A  multicomponent  system  with  a  practical 
number  of  columns  can  have  on  the  order  of  1000  equa¬ 
tions  or  more.  Furthermore,  physical  property  calcula¬ 
tions  for  vapor-liquid  equilibrium  can  be  outstandingly 
arduous.  Hence,  the  dynamic  simulation  of  distillation 
columns  continues  to  receive  serious  attention  because 
detailed  models  often  have  huge  computing  require¬ 
ments.  It  is  therefore  natural  to  seek  concurrent  algo¬ 
rithms  suitable  for  large-scale  concurrent  computers  in 
order  to  obtain  significant  simuiation  speedup. 

In  this  paper,  we  take  the  First  steps  toward  reaching 
this  goal.  We  consider  a  highly  simplified  binary  distilla¬ 
tion  model  with  simple  vapor-liquid  equilibrium.  A  set  of 
stiff,  coupled  nonlinear  ordinary  differential  equations 
result.  Importantly,  these  simplications  still  allow  us  to 
capture  three  central  issues:  stiffness,  large  problem  size, 
and  cs  iputational  difFiculties  arising  from  systems  which 


incorporate  coupled  columns.  Waveform  relaxation  was 
chosen  as  the  solution  approach  because  it  offers  large 
speedup  potential.  For  example,  temporal  latency  of  sub¬ 
systems  can  be  exploited,  implying  reduced  computa¬ 
tional  effort  for  such  subsystems  and,  correspondingly, 
the  opportunity  for  greater  overall  speedup  in  the  total 
simulation. 

Because  of  the  success  of  the  Mattisson’s  CONCISE 
simulator  for  VLSI  circuit  simulation  via  waveform  relax¬ 
ation,  we  have  formulated  the  distillation  model  within 
CONCISE.  We  present  preliminary  speedup  results  ob¬ 
tained  with  CONCISE  on  a  binary  n-eube  computer  sys¬ 
tem.  Wc  also  indicate  how  the  use  of  more  demanding 
models  (including,  for  example,  complex  vapor-liquid 
equilibrium)  is  likely  to  change  the  observed  concurrent 
efficiency.  Future  directions  of  this  re.search  are  also  in¬ 
dicated. 

"A  Formal  Model  and  an  Empirical  Metric  tor  Mem¬ 
ory  Latency  in  Multiprocessors" 

David  F.  SncHtng,  University  of  Leicester,  UK. 

All  multiprocessors,  particularly  shared  memory 
multiprocessors,  are  affected  by  memory  latency. 
Usually  there  is  some  performance  loss  as  memory  paths 
become  longer,  or  bank  conllicts  and  synchronization  de¬ 
lays  increase. 

The  classical  techniques  of  compensating  for  mem¬ 
ory  latency,  such  as  vectorization  and  instruction  pipelin¬ 
ing,  arc  effective  in  serial  processing  environments,  but  to 
little  to  alleviate  the  performance  degradation  due  to 
multiprocessing.  In  this  paper  the  fundamental  aspects 
of  memory  latency  arc  identified  and  outlined,  and  a 
generalized,  formal  model  for  memory  latency  is  pro¬ 
posed. 

This  formal  model  is  compared  to  an  empirical  me¬ 
tric  for  determining  the  actual  effect  of  memory  latency 
on  a  particular  multiprocessor  running  a  given  algorithm. 
This  memory  latency  metric  is  a  measurable  quantity  that 
indicates  the  extent  to  which  a  particular  multiprocessor 
architecture  can  hide  or  mask  the  memory  and  synchroni¬ 
zation  latency  inherent  in  a  given  algorithm. 

The  ability  of  several  multiprocessors  to  mask  latency 
is  analyzed  using  these  two  concepts.  1  he  Denelcor  HEP 
and  the  transputer-based,  shared  memory  multiproces¬ 
sor  being  developed  at  the  University  of  Leicester  will 
serve  as  case  studies.  These  two  machines  are  known  to 
have  the  ability  to  hide  memory  and  synchronization 
latency  and  are  therefore  highly  suitable  as  case  studies. 
Nonlatency-hiding  machines  will  also  be  discussed. 

"Implementing  and  Tuning  Multigrid  on  Local  Mem¬ 
ory  Parallel  Computers" 

Karl  Sotchenbach,  SUPllENUM  Gnbtl,  Wesi  Germany. 

Multigrid  (MG)  methods  are  often  claimed  to  be  not 
suitable  for  highly  or  massively  parallel  multiprocessor 
systems  with  many  proccs.sors  because  of  the  "lack  of  par¬ 
allelism"  on  the  coar.se  grids.  Indeed,  if  implemented  in 
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a  straightforward  manner  the  multiprocessor-efficiency 
of  M(j  methods  on  local  memory  machines  may  be  signi¬ 
ficantly  lower  compared  to  single  gritl  methods.  The  de¬ 
gree  of  efficiency  reduction  is  determined  on  one  hand  by 
the  system  characteristics  (c.g.  lime  for  communication) 
and  on  the  other  hand  by  the  implementation  of  the 
coarse  grid  data  structure  on  the  multiprocessor  system. 

The  purpose  of  this  paper  is  firstly  to  investigate  sev¬ 
eral  possible  coarse  grid  techniques  and  second  to  show 
how  the  efficiency  of  a  given  MG-algorithm  depends  on 
critical  system  parameters.  As  a  result  of  both  consider¬ 
ations  we  expect  some  knowledge  on  how  to  tune  paral¬ 
lel  MG  methods  optimally  to  a  given  multiprocc.ssor 
system  by  proper  choice  of  the  coarse  grid  techniques. 
The  basic  requirements  for  an  implementation  of  MG  on 
parallel  systems  are  briefly  outlined,  namely, 

•  MG  components  with  a  high  degree  of  inherent  paral¬ 
lelism  (like  red-black  relaxation) 

•  Data  distribution  concept  that  ensures  optimal  load 
balancing  and  minimal  communication  (grid  partition¬ 
ing)- 

In  order  to  have  a  flexible  tool  which  allows  perfor¬ 
mance  estimates  of  different  coarse  grid  techniques  and 
different  multiprocessor  systems,  an  MG  simulation  pro¬ 
gram  was  developed.  Its  underlying  assumptions,  its  cal¬ 
culation  and  communication  model,  and  its  facilities  are 
described.  The  aim  is  to  evaluate  different  techniques  for 
the  treatment  of  the  coarse  grids  on  local  memory  ma¬ 
chines,  to  analyze  the  dependence  on  system  parameters, 
and  to  predict  the  behavior  of  these  techniques  on  paral¬ 
lel  machines  which  are  not  available  yet  (like  Suprenum). 
Different  possibilities  of  accelerating  the  coarse  grid  cal¬ 
culations  on  local  memory  machines  are: 

•  Stop  the  coarsening  process  and  perform  relaxations 
on  some  intermediate  level 

•  Collect  the  coarse  grid  data  to  fewer  processors 

•  Use  larger  overlap  areas  in  order  to  reduce  the  num¬ 
ber  of  messages. 

A  detailed  analysis  of  the  calculation  and  communi¬ 
cation  work  related  to  these  techniques  is  given.  Finally, 
results  of  the  simulation  are  presented.  The  estimated 
data  are  compared  to  measurements  on  existing  local 
memory  machines.  Performance  estimates  of  the  differ¬ 
ent  coarse  grid  techniques  are  given.  The  multiproces¬ 
sor-efficiency  of  the  overall  MG  method  is  estimated  and 
requirements  on  the  multiprocessor  system  characteris¬ 
tics  are  derived. 

"Efficient  Parallel  Imptementable  Algorithms  for 
Determination  of  Line-of-Sight  Visibility" 

ArunK  Somani,  Univeracy  of  Washinffon,  Seattle,  and  Farid  Mamag- 
hani,  BBNDalta  Graphics,  Inc.,  US. 

This  paper  presents  two  efficient  and  fast  algorithms 
which  vectorized  the  task  of  finding  the  visible  points 


within  a  square  of  side  2n  (or  a  circle  of  radius  n)  on  a 
plan  view  display  of  terrain  maps  from  any  given  view¬ 
point.  These  algorithms  cither  can  be  implemented  on  a 
sequential  computer  or  on  .special-purpose  parallel  hard¬ 
ware. 

Inadequacies  and  inefficiencies  of  some  of  the  tech¬ 
niques  currently  used  to  achieve  this  arc  discussed.  The 
details  of  a  pipelined  parallel  architecture  for  implemen¬ 
tation  of  the  new  algorithms  is  also  presented. 

“Divide  and  Conquer  Algorithms  for  SIMD 
Architectures” 

l.ennartJohnsson,  Thinking  Machines  Corporation,  US,  andD.C.  Soren. 
sen,  Argonne  National  Laboratory,  US. 

Divide  and  conquer  algorithms  for  the  symmetric  al¬ 
gebraic  eigenvalue  problem  have  been  successful  on  both 
shared  memory  and  distributed  memory  architectures. 

We  investigate  the  suitability  of  this  scheme  for 
SIMD  architectures  such  as  the  Connection  Machines.  A 
version  of  the  algorithm  that  is  suitable  for  computing 
eigenvalues  of  a  symmetric  iridiagonal  matrix  will  be 
presented. 

This  algorithm  has  potential  for  utilizing  all  proces¬ 
sors  of  a  large  SIMD  array  throughout  the  course  of  find¬ 
ing  all  the  eigenvalues  of  the  given  matrix. 

"Parallel  Multigrid  for  Solving  the  Steady-State,  In¬ 
compressible  Navier-Stokes  Equations  on  General 
2-D  Domains" 

Dr.  Klaus  Stiiben,  Gcsellschaft  fiir  Mathematik  und  Daienverarbeitung 
ntbll,  IVest  Germany. 

Multigrid  methods  for  solving  elliptic  partial  dif¬ 
ferential  equations  are  known  to  be  optimally  efficient  on 
sequential  computers  —  i.e.,  the  computational  work  re¬ 
quired  to  solve  the  corresponding  discrete  system  of  equ¬ 
ations  is  proportional  to  the  number  of  unknowns.  The 
additional  fact  that  all  components  of  a  multigrid  method 
are  fully  parallelizable,  make  such  methods  highly  attrac¬ 
tive  for  parallel  computers. 

A  parallel  (MIMD,  local  memory)  multigrid  method 
for  solving  the  steady-state  Navier-Stokes  equations  on 
general  2D-domains  is  presented,  based  on  a  finite-vol¬ 
ume  discretization  for  general  (nonorthogonal)  meshes. 
The  method  served  as  the  basis  of  a  program  package  for 
the  parallel  solution  of  general  2-D  flows.  Details  on  the 
muitigrid  algorithm  as  well  as  on  the  general  program  de¬ 
sign  will  be  given. 

Although  the  program  is  actually  being  developed  in 
the  German  supercomputer  project  SUPRENUM,  high 
priority  has  been  given  to  the  requirement  of  portability: 
The  program  is  designed  so  that  it  can  be  used  both  on 
sequential  and  parallel  computers.  This  is  achieved  by 
means  of  a  flexible  and  general  library  of  "communication 
routines";  all  machine-dependent  constructs  are  com¬ 
pletely  hidden  inside  these  routines,  which  perform  the 
mapping  of  processing  as  well  as  the  communication. 
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"Lattice  Gas  Simulations  of  Two-Dimensional  Tur¬ 
bulence  on  IBM  3090/VF" 

.S'.  Sued  and  P.  Santangeto,  IBM/ECSEC,  Italy. 

Lattice  Gas  (LG)  models  obeying  cellular  automata 
(CA)  rules  have  recently  gained  a  great  deal  of  interest  as 
a  new  alternative  tool  for  the  simulation  of  complex  hy¬ 
drodynamic  phenomena.  In  fact,  despite  of  the  fully  dis¬ 
crete  nature  of  their  underlying  phase  space,  l.G  do 
exhibit  some  physical  features  like  sound  waves,  viscous 
behavior,  and  hydrodynamic  instabilities  that  are  typical 
of  continuous  systems  such  as  fluids.  From  the  computa¬ 
tional  point  of  view,  the  essential  property  of  CA  is  the 
fact  of  working  with  quantized  variables  — i.e.,  variables 
which  take  values  only  in  a  small  set  of  integers,  typically 
just  /.ero  or  one,  and  arc  consequently  represented  exact¬ 
ly  with  only  a  few  bits  per  dynamic  variable  in  the  com¬ 
puter  storage. 

In  this  paper,  we  discuss  the  few  and  simple  criteria 
which  govern  the  efficient  implementation  of  CA  rules  on 
the  IBM  ;^0‘K)  vector  multiprocessor.  Performance  data 
are  also  pre.sentcd  and  commented  on.  As  a  physical  ap¬ 
plication,  we  present  some  results  concerning  the  free 
decay  of  halogenous  turbulence  in  two  dimensions,  ob¬ 
tained  running  the  hexagonal  Frisc-Hasslacher-Pomcau 
lattice  gas  on  a  very-high-resolution  grid  containing 
8192x8 192  nodes.  One  of  the  main  results  emerging  from 
these  high-resolution  runs  is  the  neat  detection  of  the 
long-range  forces  governing  the  interaction  between  the 
coherent  structures  (vortices)  present  in  the  fluid. 

"Numerical  Software  Development  for  Local  Mem¬ 
ory  Machines" 

Bernhard  Thomas,  SUPBENUM  GnbH,  West  Germany. 

Scientific  supercomputing  on  large-scale  parallel 
machines  encounters  essentially  new  problems  -  as  com¬ 
pared  to  vector  supercomputers -both  with  respect  to 
development  and  performance  of  programs.  As  for  pro¬ 
gram  development,  handwriting  efficient  parallel  code 
anew  still  lakes  some  effort,  whereas  automatic  paralleli¬ 
zation  of  existing  code  is  only  just  at  its  beginning.  As  for 
performance,  well-established  sequential  algorithms  may 
not,  or  not  efficiently,  be  parallelized,  and  new  parallel  al¬ 
gorithms  have  to  deal  with  possible  lack  of  stability.  How¬ 
ever,  there  are  problem  areas  and  machine  architectures 
where  some  of  these  problems  can  be  substantially  re¬ 
laxed. 

In  the  SUPRENUM  project,  a  parallel  supercom¬ 
puter  is  developed  to  dehver  performance  in  the  range  of 
several  Gflops  for  scientific  computing.  Basic  compo¬ 
nents  of  the  local  memory  architecture  are  powerful  vec¬ 
tor  processing  nodes,  connected  by  a  very  fast 
hierarchical  bus  system.  Concurrently  with  the  system 
development,  software  developments  in  SUPRENUM 
attack  the  above  problems  on  different  levels; 

•  Parallel  numerical  software  is  developed  that  t)pti- 
ma'ly  exploits  the  system’s  capacities,  c.g.,  basic  rou¬ 


tines  in  linear  algebra,  mulligrid  PDE-solvers,  grid 
generation,  and  CFD  packages. 

•  Libraries  are  provided  that  facilitate  writing  of  port¬ 
able  and  efficient  parallel  programs  by  means  of  stand¬ 
ardized,  optimal  routines  for  typical  tasks  arising  with 
message  passing  systems,  c.g.,  generations  of  and  com¬ 
munication  within  a  process  system,  mapping  to  the 
target  hardware. 

•  Tools  are  developed  for  interactive  parallelization  of 
existing  code  and  for  the  support  of  generating  new 
parallel  applications  programs. 

This  paper  gives  details  about  these  strategies,  with 
an  emphasis  on  software  developments  for  grid-based 
problems  in  conjunction  w'ith  the  message  passing  library. 

"Parallel  Processing  Within  a  Virtual  Machine" 

lu.J.  loomcy  et  al,  IHM  Ktn^sK/n,  US 

The  ability  of  single  scientific  applications  to  utilize 
the  full  computational  capability  of  a  tightly  coupled, 
shared  memory,  multiprocessor  system  is  becoming  in¬ 
creasingly  important.  To  date,  applications  run  with 
CMS  in  a  virtual  machine  have  been  restricted  to  a  uni- 
proces.sor  view.  This  paper  describes  a  method  of  enab¬ 
ling  numerically  intensive  applications  to  simultaneously 
use  all  processors  of  a  multiprocessor  configuration  with¬ 
in  a  single  virtual  machine. 

Virtual  CPU’s,  a  feature  of  VM/XA  SF  and  VM/XA 
SPl,  are  used  to  specify  dispalehable  units  of  work  to  the 
VM  control  program.  A  CMS  server  is  described  that 
allows  all  virtual  CPU’s  to  obtain  CMS  services.  The 
methods  discussed  have  been  implemented  for  FOR¬ 
TRAN  applications.  However,  l  hese  methods  are  applic¬ 
able  to  other  languages  as  well.  The  implemcntaiton  is 
the  Ba.sis  for  the  IBM  Parallel  FORTRAN  PRPO  on 
VM/XA  SPl. 

"Implementational  Aspects  of  ADA  for  Vector  Pro¬ 
cessing  Target  Machines" 

Peter  Wehrum  a  ai,  >  iemens  AG  and  University  of  Marburg,  WV.u  (hr- 
many. 

The  purpose  of  this  paper  is  to  describe  some  aspects 
of  parts  of  the  MAF  project  786,  "Parallel  Numerical  Pro¬ 
cessing  through  Ada"  (PNPA).  This  project  has  been 
dealt  with  by  Siemens  AG  together  with  the  University  of 
Marburg.  The  aim  is  to  examine  the  programing  lan¬ 
guage  Ada  for  its  capability  in  the  field  of  scientific  par¬ 
allel  programing;  i.e.,  to  look  to  the  problem  of  how  to 
write  efficient  numerical  applications  for  supercompu¬ 
ters  in  Ada,  and  how  to  translate  them  into  highly  paral¬ 
lel  machines  code.  This  part  of  the  work  is  centered  on 
vector  proces.sors  and  vectorization  techniques. 

Ada  may  be  regarded  as  a  programing  language  ap¬ 
propriate  for  the  writing  of  scientific  anrl,  in  particular, 
parallel  applications.  First  of  all,  Ada  supports  software 
engineering  concepts  such  as  programing  in  the  large, 
software  reuse,  reliability,  and  readability,  language  fca- 
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turcs  supporting  these  concepts  are,  inter  alia,  separate 
compilation,  generic  subprogram  and  packages,  excep¬ 
tion  handling,  overloading,  and  operator  definition. 

Regarding  the  granularity  of  concurrent  language 
constructs,  parallelism  may  occur  on  at  least  three  differ¬ 
ent  levels; 

•  Expression  level 

•  Statement  level 

•  Module  level. 

The  latter  is  supported  by  Ada  through  tasking;  the 
former  two  levels,  however,  are  not  directly  covered  by 
Ada  language  features.  As  for  parallelism  at  the  express¬ 
ion  level,  Ada  provides  only  a  few  predefined  operations 
for  arrays  but  docs  not  supply  predefined  arithmetic  vec¬ 
tor  operations. 

Basically,  there  arc  two  approaches  which  make  it 
possible  to  exploit  the  facilities  offered  by  a  vector  pro¬ 
cessor; 

First,  an  auto-vectorizing  compiler  is  built  and 
utilized  for  the  translation  of  scientific  programs,  which 
may  be  written  in  sequential,  scalar  Ada.  The  algorithms 
in  this  kind  of  compiler  may  be  based  on  vectorizing  tech¬ 
niques  as  described,  for  example,  by  Kennedy  ct  al.,  Kunk 
et  al.  and  Lamport. 

Second,  a  standard  generic  package  is  provided  that 
specifics  vector  and  matrbe  operations.  Instantiations  of 
this  vector  package  yield  the  appropriate  operations  in 
high-level  vector  notation,  on  which  scientific  applica¬ 
tions  can  then  be  based.  The  bodies  of  the  vector  oper¬ 
ations  may  be  implemented  by  means  of  low-level 
features,  such  as  machine  code  insertions  and  interfaced 
Assembler  and  FORTRAN  subprograms. 

In  the  PNPA  project,  both  approaches  have  been  fol¬ 
lowed  in  that  a  vector  package  has  been  specified  (which 
is  simpler  and  less  expensive  to  implement),  and  the  re¬ 
quirements  of  an  auto-vectorizing  compiler  have  been 
studied. 

Additional  requirements  discussed  within  the  pro¬ 
ject  concern  the  environment  in  which  computer  aided 
.software  engineering  (CASE)  for  vectorizing  Ada  pro¬ 
grams  proceeds,  including  an  interactive  vectorizing  op¬ 
timizer  for  j)rcproccssing  Ada  code,  and  further  APSE 
elements  such  as  a  debugging  tool  and  a  vectorized 
numeric  library. 

The  interactive  vectorizing  optimizer  performs  a 
code-to-code  transformation  at  source  level  supporting 
vectorization  in  order  to  ease  the  vector  analysis  process 
of  later  compilations.  This  tool  has  been  specified  by 
graph-related  techniques  as  a  knowledge-based  system 
including  a  learning  facility. 

The  environment  is  based  on  a  workstation  network 
connected  to  a  scalar  front-end  and  the  vector  back-end 
processor.  The  vector  target  machine  primarily  concen¬ 
trated  on  is  the  Siemens  7.800  VP  200  /  Fujitsu  VP  200. 


"Parallel  Algorithms  for  Some  Reservoir  Engineer¬ 
ing  Problems" 

Mtiry  WTicf/fT,  fticc  University/Univcniiy  uf  lloiislon,  US. 

Parallel  algorithms  based  on  domain  decomposition 
and  operator  splitting  are  presented  for  some  problems 
in  reservoir  engineering.  Theoretical  and  computational 
results  are  both  discussed. 

“An  Additive  Variant  of  the  Schwarz  Alternating 
Method  for  the  Case  of  Many  Subregions" 

Olof  Widhmd,  New  York  University. 

In  recent  years,  there  has  been  a  revival  of  interest  in 
the  Schwarz  alternating  method.  Other  domain  decom¬ 
position  algorithms,  in  particular,  so-called  iterative  sub¬ 
structuring  methods,  have  also  been  developed  to  solve 
elliptic  finite  element  problems. 

In  this  paper,  we  present  an  additive  variant  of  the 
Schwarz  method  for  elliptic  problems,  which  shows  great 
promise  for  parallel  computers.  By  using  ideas  previously 
proven  very  useful  for  iterative  substructuring  methods, 
we  are  able  to  design  a  method  which  converges  with  a 
rate,  which  is  independent  of  the  mesh  size  as  well  as  the 
number  of  subregion. 

Wc  note  that,  as  is  the  case  with  other  domain  decom¬ 
position  methods,  a  mechanism  for  global  transportation 
of  information  is  needed  in  order  to  obtain  fast  conver¬ 
gence  if  there  are  many  subregions.  In  our  algorithm  this 
is  accomplished  by  solving  a  global  coarse  problem  in  ad¬ 
dition  to  the  local  problems  typically  associated  with 
Schwarz’  algorithm. 

"Spectral  Decomposition  Methods  for  the  Numeri¬ 
cal  Solution  of  Partial  Differential  Equations  Using 
Vector  and  Parallel  Processors" 

Dm-id  M.  Young,  the  University  of  Texas,  US. 

The  paper  is  concerned  with  the  numerical  solution 
of  partial  differential  equations  based  on  finite  difference 
methods  or  on  finite  clement  methods.  Iterative  methods 
are  used  to  solve  the  large,  sparse  linear  systems  of  equ¬ 
ations  which  arise.  The  emphasis  is  on  the  use  of  algo¬ 
rithms  which  are  suitable  for  use  on  vector  and  parallel 
proces.sors. 

The  focus  is  on  methods  which  arc  based,  not  upon 
the  decomposition  of  the  physical  domain,  but  rather  on 
the  decomposition  of  the  eigenvalue  spectrum  of  the  coef¬ 
ficient  matrix  A  or  of  a  preconditioned  matrix  with  the 
some  basic  iterative  method  such  as  the  Jacobi  method  or 
the  symmetric  SOR  method.  These  spectral  decomposi¬ 
tion  methods  are  related  to  multigrid  methods  as  well  as 
to  additive  correction  methods  used  by  Adams  (1985)  and 
to  multisplitting  methods  of  O’Leary  and  White  (1985). 

They  involve  first  splitting  the  initial  residuals  corre¬ 
sponding  to  several  initial  approximate  solutions  into  sev¬ 
eral  components  each  of  which  is  primarily  associated 
with  a  part  of  the  eigenvalue  spectrum.  Each  such  com¬ 
ponent  can  then  be  treated  independently  in  parallel  and 
the  results  combined  to  minimize  the  norm  of  the  error. 
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The  process  can  be  repeated.  Numerical  results  illustrat¬ 
ing  the  use  of  the  procedure  are  given. 

"Vector  and  Parallel  Computing  for  Nonlinear  Net¬ 
work  Optimization" 

S!a\Tos  A.  Zenios,  Unhersily  of  Fcnmyls  diiui,  FlulaJctphui. 

Nonlinear  optimization  models  with  network  con¬ 
straints  are  recurrent  in  applications  in  transportation, 
economics,  statistics,  operations  research,  anti  so  on.  We 
will  discuss  the  design  and  implementation  of  parallel  al- 
gc'rithms  for  this  problem  class. 

First,  a  primal  truncated  Newlon  algorithm  has  been 
streamlined  for  the  environment  of  the  CRAY  X-MIV-W 
using  both  vectorizalion  and  niicrotasking.  The  resultant 
algor'thm  is  improved  in  performance  by  a  factor  of  live 
over  the  scalar  code. 

Second,  we  design  and  implement  a  relaxation  al¬ 
gorithm  on  a  massively  parallel  Connection  Machine, 
CM-1.  One  of  the  largest  problem  instances  is  solved  in 
less  than  two  seconds  on  the  CM- 1,  The  same  problem 
takes  1.5  CPU  minutes  on  a  vector  supercom(niler  and 
many  hours  ( >8)  on  a  minicomputer. 

The  algorithms  will  be  introduced,  I'ollowed  by  a  dis¬ 
cussion  of  the  parallel  implementations  on  the  CM-1  and 
the  CR.AY  X-MP/48.  We  will  also  present  computational 
results  with  applications  data  and  randomly  generated 
problems. 

"An  Advanced  Programing  Environment  for  a 
Supercomputer" 

Ifansl’.  '/.ima.  lionn  University.  Wc.st  Germany. 

This  paper  discusses  the  programing  environment 
that  is  currently  being  developed  for  the  (ierman  super¬ 
computer  .SUPRENUM.  The  SUPRENUM  system  is  a 
loosely-coupled,  massively  parallel  multiprocc.ssing  ,sys- 
tem  whose  overall  design  is  oriented  towards  the  solution 
of  large-scale  problems  in  numerical  simulation  and 
scientific  supcrcomputing,  with  a  particular  emphasis  on 
the  multigrid  method  for  the  solution  of  large  systems  of 
partial  differential  equations. 

The  paper  concentrates  on  two  tools  of  the  program¬ 
ing  environment  that  arc  specifically  tailored  to  support 
the  kernel  class  of  applicatiims  on  a  high  level  of  abstrac¬ 
tion,  and  at  the  same  time  achieve  a  high  degree  of  run¬ 
time  efficiency.  These  tools  are  the  semiautomatic 
parallelization  system  SUPERB  and  the  very  high-level 
specification  system  SUSPENSE. 

The  objective  of  SU'^ERB  i.s  not  only  the  paralleliza¬ 
tion  of  "dusty-dcck"  programs  by  transforming  sequential 
FORTRAN  77  programs  into  parallel  programs  in  SU¬ 
PRENUM  FORTRAN  (a  parallel  superset  of  FOR¬ 
TRAN  77  supporting  the  features  of  the  SUPRENUM 
architecture);  in  addition  io  this  it  will  play  a  crucial  role 
in  the  major  program  development  paradigm  of  SUPRE¬ 
NUM  which  specifics  a  multistage  procc.ss  that  starts  with 
the  writing  of  a  sequential  program,  and  proceeds  to  the 
final  irallel  SUPRENUM  FORTRAN  program  in  a 


number  of  steps  that  are  associated  with  the  .step-by-slcji 
validation  and  verification  of  various  irropertics  i)f  the 
program. 

The  specification  system  SUSPENSE  enables  pro¬ 
graming  on  a  very  high,  application-oriented  level;  it 
allows  the  formulation  of  partial  differential  equations 
and  their  solution  methods  in  the  language  of  numerical 
mathematics.  The  largely  automatic  process  of  transfor¬ 
ming  specifications  into  SUPRENUM  FORTRAN  pro¬ 
grams  is  being  supported  by  a  knowledge  base  that  is  part 
o(  the  system. 

Student  Scholarship  Winners 

"Efficient  Parallel  Programs  Through  Pipelined 
Block  Algorithms,  the  QR  Decomposition  as  an 
Example" 

Christtan  Bischof,  Cornd!  i'/nwriuy.  AVve  York. 

Block  algorithms  arc  efreclive  for  obtaining  efficient 
and  portable  algorithms  on  computers  with  special  vec¬ 
tor  proce.ssing  hardware.  We  show  that  these  advantage^ 
e.\tcnd  to  parallel  architectures.  Combining  bU)ck- 
oriented  algorithms  and  pipelining,  we  obtain  programs 
that  have  good  processor  utilization  and  load  balancing 
properties  and  arc  easy  to  port  across  different  architec¬ 
tures.  We  illustrate  this  technique  with  an  algorithm  for 
computing  the  QR  factorization. 

"An  OR-Parallel  Execution  Model  for  Full  Prolog" 

Johannes  Isngcis,  Insiiitit  fur  Informatik,  Abietliing  III,  IVieinisehe  I'rie- 
drich  Wilhelms  Univcrsiicit,  Unnn,  H  a-t  Germany. 

This  paper  describes  an  OR-parallel  execution 
model  for  Prolog.  The  goal  is  to  execute  full  Prolog  rather 
than  a  subset  of  Prolog  in  a  fast  way.  Space  optimization 
i.s  not  considered. 

We  want  to  break  with  the  prejudice  that  "Many  of 
the  extensions  of  logic  programing  included  in  most  Pro¬ 
log  systems  arc  meaningful  only  in  single-processor,  se¬ 
quential  .systems.  Most  notable  are  assert  and  retract... 
and  the  cut  operation..." 

"Parallel  Compact  Symmetric  FFTs" 

Van  Emden  Henson,  Unmersiiy  of  Colorado,  lioiilder. 

Since  its  introduction  in  1965,  the  Fast  Fourier  T rans- 
form  has  become  one  of  the  most  widely  used  algorithms 
of  computational  mathematics.  Modern  signal  process¬ 
ing,  spectroscopy,  image  processing,  and  techniques  for 
solving  partial  differential  equations  all  depend  heavily 
on  the  FFT.  Its  enormous  popularity  is  due  largely  to  the 
fact  that  the  FFT  requires  0{N\ogN)  arithmetic  oper¬ 
ations  to  compute  the  transform  of  a  complex  vector  of 
length  N,  instead  of  the  0(N)2  operations  required  to 
compute  the  transform  as  a  matrix-vector  product.  The 
description  "compact  symmetric"  refers  tr)  a  family  of  F  IT 
algorithms  that  uses  minimal  storage  and  arithmetic  for 
data  sequences  po.sscssing  certain  symmetries. 
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The  first  such  algorithm,  generally  attributed  to 
Edson,  computes  the  transform  of  a  real  vector  using  half 
the  storage  and  half  the  operations  used  by  the  original 
FFT.  It  has  long  been  known  that  further  savings  arc 
possible  when  the  data  has  additional  symmetries,  but 
with  the  exception  of  one  little-publicized  algorithm  by 
Gentleman,  such  transforms  were  performed  by  pre-  and 
post-processing  of  d;  ta  for  use  with  conventional  FFT’s, 
which  resulted  in  marginal  savings. 

In  recent  papers  by  Swarztrauber  and  Briggs,  com¬ 
pact  algorithms  are  developed  for  sequences  with  real, 
even,  odd,  quarter  wave  even,  and  quarter  wave  odd  sym¬ 
metries.  These  algorithms,  like  the  original  Cooley- 
Tukey  FFT,  work  because  the  transform  of  a  sequence 
can  be  formed  by  combining  the  transforms  of  two  half- 
length  sequences  with  very  little  work.  By  successively 
splitting  the  original  /V-vector,  eventually  N  sequences  of 
length  one  are  produced,  which  are  their  own  transforms. 
These  are  combined  to  form  the  transforms  of  sequences 
of  length  two,  which  in  turn  are  combined  to  form  the 
transforms  of  sequences  of  length  four.  After  log/V  passes 
through  the  data  the  transform  of  the  original  vector  is 
produced.  The  new  compact  FFT’s  depend  on  the  fact 
that  when  symmetric  sequences  are  split  the  resulting 
subsequences  inherit  symmetries  (For  example,  when  an 
even  sequence  is  split,  one  of  the  resulting  sequences  is 
even,  while  the  other  is  quarter  wave  even).  Performing 
the  FFT  then  consists  of  modifying  the  Cooley-Tukey 
"combine"  formulae  to  account  for  the  changing  symme¬ 
tries  of  the  partial  sub-sequences.  The  transforms  can 
then  be  performed  in-place  (without  using  an  extra  stor¬ 
age  array).  Implementing  an  algorithm  requires  a  data 
structure  that  exploits  the  compactne.ss  allowed  by  the 
symmetry. 

This  paper  shows  how  the  compact  symmetric  algo¬ 
rithms  can  be  implemented  on  parallel  processing  ma¬ 
chines.  Serial  codes  for  the  compact  symmetric  FFT’s, 
not  previously  available,  are  implemented  as  a  prelude  to 
parallelizing  the  algorithms. 

"Vectorizing  the  Multiple-Shooting  Method  for  the 
Solution  of  Boundary-Value  Problems  and  Optimal- 
Control  Problems" 

Martin  Kiehl,  Munich  University  of  Technology,  Weir  Germany. 

Many  optimization  problems  in  science  and  engin¬ 
eering  can  be  often  described  by  optimal-control  prob¬ 
lem*;,  such  as  the  control  of  the  flight  of  an  aircraft,  the 
movement  of  a  robot,  or  a  chemical  reaction.  These  prob¬ 
lems  become  more  complicated,  since  the  solution  has  to 
consider  constraints,  so  that,  for  example,  technical  and 
security  limits  or  given  tolerances  are  satisfied.  This  is  the 
general  case  in  realistic  models. 

Many  of  these  problems  can  successfully  be  solved  by 
the  multiple-shooting  method.  It  is  therefore  of  great  in¬ 
terest,  to  make  this  algorithm  as  fast  as  possible,  and  since 
vector  computers  arc  especially  available  in  areas  dealing 
with.the  above  mentioned  problems,  the  adaption  of  the 


algorithm  to  the  new  computer  generation  is  impatiently 
awaited  by  various  users. 

As  indicated  by  the  name,  vector  computers  were 
conceived  for  the  fast  solution  of  linear  problems.  But 
they  can  also  be  used  for  nonlinear  problems,  which  can 
be  solved  by  a  sequence  of  linear  problems.  Unfortunate¬ 
ly  the  multiple-shooting  method  is  not  of  that  type. 
Therefore  it  seems  to  be  very  difficult  to  take  advantage 
of  vector  computers  in  this  case.  Nevertheless  we  will 
show,  that  it  is  still  possible. 

"The  3-D  Linear  Hierarchical  Basis  Preconditioner" 

Maria  Elizabeth  G.  Ong,  University  of  Washington,  Seattle. 

The  finite  element  discretization  of  a  self-adjoint  and 
positive  definite  problem  in  two  dimensions  using  linear 
triangular  elements  and  nodal  basis  functions  generates 
a  coefficient  matrix /I  having  a  condition  number  0(N), 
where  N  is  the  number  of  unknowns.  Yserentant  (1986) 
has  shown  that  this  can  be  improved  to  0(log\JN)~)  by 
u.sing  the  hierarchical  basis  functions.  This  improvement 
can  be  viewed  as  preconditioning  the  linear  sy.stem  asso¬ 
ciated  with  the  nodal  basis  functions,  i.e.,  A  =  AS, 
where  A  is  the  hierarchical  basis  coefficient  matrix,  A  is 
the  nodal  basis  coefficient  matrix  and  S  is  a  mesh-depend¬ 
ent  matrix.  This  preconditioning  can  be  implemented  ef¬ 
ficiently  without  forming  the  matrix  S.  A  parallel 
implementation  of  this  preconditioner  in  2D  applied  to  a 
Poisson  problem  with  Dirichlet  boundary  conditions  has 
been  done  by  Adams  and  Ong  (1987)  on  the  Flexiblc-32 
parallel  processor  at  the  NASA  Langley  Research  Cen¬ 
ter,  and  by  Greenbaum,  Li,  and  Chao  (1987)  on  the  NYU 
Ultra-computer  Prototype.  These  papers  show  that  this 
preconditioner  can  be  implemented  more  efficiently  in 
parallel  than  ICCG.  This  fact,  coupled  with  its  logarith¬ 
mic  condition  number  and  robustness  makes  the  method 
very  attractive. 

In  this  paper,  we  study  the  performance  of  this 
method  for  3-D  problems.  First,  we  extend  Yserentant’s 
results  to  three  dimensions  and  show  that  the  condition 
number  isOfiVV  as  opposed  to  0(/V*/  '^)  for  the  nodal 
basis  functions.  The  proof  is  constructed  using  linear  te¬ 
trahedral  elements.  A  refinement  procedure  is  chosen 
such  that  a  tetrahedron  at  level  k  is  divided  into  eight  cqui- 
volume  tetrahedrons  at  level  k  +  1.  The  number  of  nodes 
with7Tevelsofrefinementisgivenby(2-^-l-7)  =  N.Hier- 
archical  basis  functions  are  then  defined  at  the  nodes,  and 
the  discrete  solution  is  expressed  as  a  linear  combination 
of  these  basis  functions.  With  the  aid  of  interpolating 
polynomials  defined  at  each  level  of  refinement,  the  sol¬ 
ution  can  be  decomposed  into  hierarchical  basis  compo¬ 
nents  at  each  level.  Using  this  procedure  and  an 
inequality  we  derived  for  a  function  defined  on  a  sphere, 
we  confirm  Yserentant’s  conjecture  that  the  condition 
number  of  the  coefficient  matrix  A  using  hierarchical 
basis  functions  is  bounded  by  0(A/V  ^).  Wc  verify  this  re¬ 
sult  by  comparing  iteration  counts  to  solve  the  linear  sys¬ 
tem  using  the  preconditioned  conjugate  gradient  method. 
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Next,  wc  discuss  a  strategy  for  implementing  the  3-D 
hierarchical  basis  prccondilioner  on  parallel  processors 
with  shared  memory  such  as  the  NASA  Langley’s  Flex- 
ible-32.  For  2-D  problems,  multiplication  by  y4  =  AS 
can  be  efficiently  implemented  and  operations  can  be 
saved  so  that  the  work  count  of  the  preconditioned  eon- 
jugate  gradient  method  is  O(NlogN).  The  corresponding 
issue  of  the  efficient  implementation  of  multiplication  by 
A  =  S^AS  will  be  discussed  for  3-D  problems. 

"A  New  Parallel  Algorithm  for  LU  Decomposition" 

He  Zhang,  Department  of  Mathematics,  Temple  University,  Philadel¬ 
phia,  Pennsyhania. 

In  this  paper,  we  will  develop  a  new  parallel  scheme 
for  Lctoring  a  given  matrix  A  into  its  lower  (L)  and  upper 
(U)  factored  matrices.  Based  on  this  technique,  a  near- 


optimal  parallel  algorithm  for  LU  decomposition  is  de¬ 
veloped,  where  some  of  its  important  features  arc: 

•  Its  design  is  in.sensitive  to  any  number  of  processors, 
and  its  performance  grows  monotonically  with  them. 

•  It  is  especi^lly  good  for  large  matrices,  with  dimensions 
large  relatively  to  the  number  of  processors  in  the  sys¬ 
tem.  In  this  case,  it  achieves  the  optimal  speed-up,  op¬ 
timal  efficiency,  and  very  low  communication 
complexity. 

•  It  can  be  used  in  both  distributed  parallel  computing 
environment  and  tightly  coupled  parallel  computing 
system. 

•  This  algorithm  can  be  mapped  onto  any  parallel  archi¬ 
tecture  without  any  major  programing  difficulties  or  al¬ 
gorithmic  changes. 
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